Post 8: Support Vector Machines (SVMs) — with a crystal property example

Module 2 · Core Algorithms

Learn how SVMs find the widest possible gap between classes, why support vectors alone define the boundary, and how the kernel trick extends SVMs to non-linear crystal property classification — without ever computing high-dimensional coordinates explicitly.

Abderrahmane REGGAD · June 16, 2026

Module 2

📐

Algorithm

Support Vector Machine (SVM)

🎯

Key Idea

Maximum margin hyperplane

🔬

Materials Target

Crystal symmetry classification

🌀

Key Trick

Kernel function (RBF, polynomial)

Decision trees and random forests (Post 7) partition feature space with axis-aligned rectangles. Support Vector Machines (SVMs) take a fundamentally different approach: they search for the single hyperplane (line in 2D, plane in 3D, hyperplane in higher dimensions) that separates the classes with the maximum possible margin — the widest gap between the boundary and the nearest training points of each class. This maximum-margin principle gives SVMs excellent generalisation even with few training samples, making them particularly well-suited to materials science datasets, where DFT-labelled compounds are expensive to compute.

📐

What we will classify

Twelve transition-metal compounds labelled by crystal symmetry group: cubic (rock-salt / fluorite / perovskite) vs. non-cubic (rutile / wurtzite / corundum / NiAs-type), using tolerance factor τ, electronegativity difference ΔEN, and average coordination number CN as features.

1. The Maximum-Margin Hyperplane

Suppose our training data are linearly separable in feature space — there exists at least one hyperplane that correctly separates all positive from all negative examples. Many such hyperplanes exist; which one should we choose? The SVM selects the one that maximises the margin — the perpendicular distance from the hyperplane to the nearest training point on each side.

📐 Hyperplane equation

w · x + b = 0

w = weight vector (normal to the hyperplane)
b = bias (offset from origin)
Margin = 2 / ‖w‖

Positive class: w · x + b ≥ +1
Negative class: w · x + b ≤ −1

💡

Why maximise the margin?

A larger margin means the classifier is more confident — points far from the boundary are classified with high certainty. By VC theory, maximising the margin minimises the upper bound on generalisation error, even for small training sets. This is why SVMs outperform logistic regression when labelled data are scarce (as in DFT databases).

2. The Optimisation Problem

Maximising the margin 2/‖w‖ is equivalent to minimising ½‖w‖² subject to the constraint that all training points are correctly classified and lie outside the margin.

🔧 Hard-margin SVM (primal form)

minimise ½ ‖w‖²
subject to yᵢ(w · xᵢ + b) ≥ 1 for all i

yᵢ ∈ {−1, +1} = class labels
Solved by quadratic programming (QP)

🔧 Soft-margin SVM (real data — with slack ξ)

minimise ½ ‖w‖² + C · Σᵢ ξᵢ
subject to yᵢ(w · xᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0

ξᵢ = slack variable (how much point i violates the margin)
C = regularisation parameter: large C → tight margin (low bias, high variance)
small C → wide margin (higher bias, lower variance)

⚠️

Choosing C is critical

In materials datasets with noisy DFT labels (PBE vs. HSE band gaps, finite k-point sampling errors), a small C is preferred — it tolerates a few misclassified training points in exchange for a wider, more robust margin. Always tune C with cross-validation, not on the training set.

3. Support Vectors — The Points That Matter

The optimal hyperplane is determined entirely by a small subset of training points — the support vectors — those that lie exactly on the margin boundaries (‖w · xᵢ + b‖ = 1). All other training points can be removed without changing the decision boundary.

🔬

Physical meaning for crystal classification

Support vectors are the "borderline" compounds — those whose crystal structure is closest to ambiguous (e.g. a near-cubic distorted perovskite with tolerance factor τ ≈ 0.88, right at the cubic/non-cubic boundary). They are the most informative compounds in the training set; adding more points far from the boundary does not improve the model.

4. The Kernel Trick — Non-linear Boundaries Without the Cost

Most crystal property relationships are non-linear. Fortunately, the SVM dual formulation only ever requires inner products xᵢ · xⱼ between training points — never the coordinates themselves. The kernel trick replaces this inner product with a kernel function K(xᵢ, xⱼ) that implicitly computes the inner product in a (potentially infinite-dimensional) feature space.

🌀 Common kernel functions

Linear: K(xᵢ, xⱼ) = xᵢ · xⱼ
Polynomial: K(xᵢ, xⱼ) = (γ xᵢ · xⱼ + r)^d
RBF / Gaussian: K(xᵢ, xⱼ) = exp(−γ ‖xᵢ − xⱼ‖²)
Sigmoid: K(xᵢ, xⱼ) = tanh(γ xᵢ · xⱼ + r)

γ controls how far the influence of a single training point reaches.
Small γ → broad influence (smoother boundary).
Large γ → local influence (wiggly boundary, risk of overfitting).

🧪

Which kernel for crystal properties?

The RBF kernel is the default starting point — it makes no assumption about the shape of the boundary and works well with standardised features. Polynomial kernels (degree 2–3) are good when interaction terms between features matter (e.g. τ × ΔEN). For materials fingerprints (Coulomb matrix, SOAP descriptors), specialised kernels like the smooth-overlap-of-atomic-positions (SOAP) kernel outperform generic ones.

5. Our Dataset — Crystal Symmetry Classification

Twelve compounds labelled by crystal symmetry: Cubic (+1) or Non-cubic (−1). Features: Goldschmidt-like tolerance factor τ, electronegativity difference ΔEN, and average coordination number CN.

Compound	τ (tolerance factor)	ΔEN	CN	Structure	Label
NaCl	1.00	1.78	6	Rock-salt	Cubic
MgO	0.99	2.13	6	Rock-salt	Cubic
BaTiO₃	1.06	1.89	6	Perovskite	Cubic
SrTiO₃	1.00	1.97	6	Perovskite	Cubic
CaF₂	1.03	2.98	8	Fluorite	Cubic
FeO	0.97	1.61	6	Rock-salt	Cubic
TiO₂	0.78	1.90	6	Rutile	Non-cubic
ZnO	0.71	1.79	4	Wurtzite	Non-cubic
Al₂O₃	0.65	2.03	6	Corundum	Non-cubic
FeS	0.72	0.43	6	NiAs-type	Non-cubic
MnO	0.96	1.55	6	Rock-salt	Cubic
CrO₂	0.74	1.54	6	Rutile	Non-cubic

🔬

Tolerance factor and crystal symmetry

The Goldschmidt tolerance factor τ = (r_A + r_O) / [√2 (r_B + r_O)] predicts perovskite stability: τ ≈ 1.0 → ideal cubic, τ < 0.9 → distorted/non-cubic. Combined with ΔEN and CN, it forms a physically motivated feature set that the SVM can use to find a non-linear boundary between cubic and non-cubic crystal families.

6. Python Implementation

Linear SVM

from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import make_pipeline

from sklearn.metrics import classification_report

from sklearn.model_selection import LeaveOneOut, cross_val_score

import numpy as np

# Features: [τ, ΔEN, CN]   Labels: +1 = Cubic, -1 = Non-cubic

X = np.array([

    [1.00,1.78,6], [0.99,2.13,6], [1.06,1.89,6], [1.00,1.97,6],

    [1.03,2.98,8], [0.97,1.61,6], [0.78,1.90,6], [0.71,1.79,4],

    [0.65,2.03,6], [0.72,0.43,6], [0.96,1.55,6], [0.74,1.54,6]

])

y = np.array([1,1,1,1,1,1,-1,-1,-1,-1,1,-1])

# Linear SVM — always standardise features before SVM!

lin_svm = make_pipeline(StandardScaler(), SVC(kernel='linear', C=1.0))

lin_svm.fit(X, y)

print(classification_report(y, lin_svm.predict(X)))

# Number of support vectors

print("Support vectors per class:", lin_svm[-1].n_support_)

RBF Kernel SVM + Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Grid search over C and γ

param_grid = {

    'svc__C':     [0.1, 1, 10, 100],

    'svc__gamma': ['scale', 'auto', 0.1, 1.0]

}

rbf_pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf'))

gs = GridSearchCV(rbf_pipe, param_grid, cv=LeaveOneOut(), scoring='accuracy')

gs.fit(X, y)

print(f"Best C={gs.best_params_['svc__C']}, γ={gs.best_params_['svc__gamma']}")

print(f"Best LOO accuracy: {gs.best_score_:.2f}")

# Decision function (margin scores)

best_svm = gs.best_estimator_

scores = best_svm.decision_function(X)

for name, score in zip(names, scores):

    print(f"  {name:<8} margin score = {score:+.3f}")

7. SVM vs Other Models — When to Use SVM

Criterion	Logistic Reg.	Random Forest	SVM (RBF)
Small dataset (< 100)	OK	OK	Best
High-dimensional features	OK	Slow	Excellent
Non-linear boundary	No	Yes	Yes (kernel)
Probability output	Native	Native	Requires Platt scaling
Interpretability	Coefficients	Feature importance	Support vectors only
Scaling needed	Yes	No	Yes (critical)
Large dataset (> 10k)	Fast	Fast	Slow (O(n²)–O(n³))

⚠️

Always standardise before SVM

The SVM margin depends on ‖w‖ — the Euclidean distance in feature space. If τ ranges from 0.65–1.06 and CN ranges from 4–8, the CN axis dominates the distance calculation and the SVM effectively ignores τ. StandardScaler is not optional for SVM — it is mandatory.

8. Multi-class SVM — One-vs-One Strategy

SVMs are inherently binary classifiers. For multi-class problems (Metal / Semiconductor / Insulator), scikit-learn uses one-vs-one (OvO): it trains one binary SVM for every pair of classes (3 classes → 3 SVMs) and takes the majority vote. Alternatively, one-vs-rest (OvR) trains one SVM per class against all others.

🔀 Multi-class strategies

One-vs-One (OvO): n(n−1)/2 classifiers → 3 SVMs for 3 classes
One-vs-Rest (OvR): n classifiers → 3 SVMs for 3 classes

Predict: OvO → majority vote across all pairwise SVMs
OvR → class with highest decision function score

Quick Check

1. What are "support vectors" in an SVM?

A. All training points used to compute the hyperplane
B. The training points closest to the decision boundary — the ones that define the margin
C. The feature weights w of the hyperplane
D. Vectors that point in the direction of maximum variance

2. In a soft-margin SVM, what happens when you set C to a very large value?

A. The margin becomes very wide and the model underfits
B. The model tries hard to classify all training points correctly — narrow margin, risk of overfitting
C. The kernel function is deactivated
D. The number of support vectors increases to include all training points

3. Why must features always be standardised before training an SVM?

A. SVM requires integer-valued features
B. Standardisation converts the data to probabilities
C. The SVM margin is measured in Euclidean distance — features on larger scales would dominate and make other features irrelevant
D. StandardScaler removes outliers that would become support vectors

Core Algorithms SVM Kernel Trick Maximum Margin Crystal Symmetry RBF Kernel Soft Margin

Header Ads Widget

Last Posts

Post 8: Support Vector Machines (SVMs) — with a crystal property example

1. The Maximum-Margin Hyperplane

2. The Optimisation Problem

3. Support Vectors — The Points That Matter

4. The Kernel Trick — Non-linear Boundaries Without the Cost

5. Our Dataset — Crystal Symmetry Classification

6. Python Implementation

Linear SVM

RBF Kernel SVM + Hyperparameter Tuning

7. SVM vs Other Models — When to Use SVM

8. Multi-class SVM — One-vs-One Strategy

Quick Check

About me

My page

Popular Posts

Post 1: What is Artificial Intelligence? A researcher's first look

Post 2: ML vs Traditional Simulation — Where does DFT end and ML begin?

Post 4: Types of ML — Supervised, Unsupervised & Reinforcement Learning

Post 7: Decision Trees and Random Forests — Handling Non-linear Boundaries

Post 3: Key Mathematical Tools — Vectors, Matrices & Probability

Post 6: Classification — is this material metallic or insulating?

Post 8: Support Vector Machines (SVMs) — with a crystal property example

Post 5: Linear Regression — Predicting Band Gap from Simple Features.

Categories

Pageviews past week

You may contact me here

Foundations

Core Algorithms

Magnetic Calculations

Menu Footer Widget