Learn how SVMs find the widest possible gap between classes, why support vectors alone define the boundary, and how the kernel trick extends SVMs to non-linear crystal property classification — without ever computing high-dimensional coordinates explicitly.
Support Vector Machine (SVM)
Maximum margin hyperplane
Crystal symmetry classification
Kernel function (RBF, polynomial)
Decision trees and random forests (Post 7) partition feature space with axis-aligned rectangles. Support Vector Machines (SVMs) take a fundamentally different approach: they search for the single hyperplane (line in 2D, plane in 3D, hyperplane in higher dimensions) that separates the classes with the maximum possible margin — the widest gap between the boundary and the nearest training points of each class. This maximum-margin principle gives SVMs excellent generalisation even with few training samples, making them particularly well-suited to materials science datasets, where DFT-labelled compounds are expensive to compute.
Twelve transition-metal compounds labelled by crystal symmetry group: cubic (rock-salt / fluorite / perovskite) vs. non-cubic (rutile / wurtzite / corundum / NiAs-type), using tolerance factor τ, electronegativity difference ΔEN, and average coordination number CN as features.
1. The Maximum-Margin Hyperplane
Suppose our training data are linearly separable in feature space — there exists at least one hyperplane that correctly separates all positive from all negative examples. Many such hyperplanes exist; which one should we choose? The SVM selects the one that maximises the margin — the perpendicular distance from the hyperplane to the nearest training point on each side.
w = weight vector (normal to the hyperplane)
b = bias (offset from origin)
Margin = 2 / ‖w‖
Positive class: w · x + b ≥ +1
Negative class: w · x + b ≤ −1
A larger margin means the classifier is more confident — points far from the boundary are classified with high certainty. By VC theory, maximising the margin minimises the upper bound on generalisation error, even for small training sets. This is why SVMs outperform logistic regression when labelled data are scarce (as in DFT databases).
2. The Optimisation Problem
Maximising the margin 2/‖w‖ is equivalent to minimising ½‖w‖² subject to the constraint that all training points are correctly classified and lie outside the margin.
subject to yᵢ(w · xᵢ + b) ≥ 1 for all i
yᵢ ∈ {−1, +1} = class labels
Solved by quadratic programming (QP)
subject to yᵢ(w · xᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0
ξᵢ = slack variable (how much point i violates the margin)
C = regularisation parameter: large C → tight margin (low bias, high variance)
small C → wide margin (higher bias, lower variance)
In materials datasets with noisy DFT labels (PBE vs. HSE band gaps, finite k-point sampling errors), a small C is preferred — it tolerates a few misclassified training points in exchange for a wider, more robust margin. Always tune C with cross-validation, not on the training set.
3. Support Vectors — The Points That Matter
The optimal hyperplane is determined entirely by a small subset of training points — the support vectors — those that lie exactly on the margin boundaries (‖w · xᵢ + b‖ = 1). All other training points can be removed without changing the decision boundary.
Support vectors are the "borderline" compounds — those whose crystal structure is closest to ambiguous (e.g. a near-cubic distorted perovskite with tolerance factor τ ≈ 0.88, right at the cubic/non-cubic boundary). They are the most informative compounds in the training set; adding more points far from the boundary does not improve the model.
4. The Kernel Trick — Non-linear Boundaries Without the Cost
Most crystal property relationships are non-linear. Fortunately, the SVM dual formulation only ever requires inner products xᵢ · xⱼ between training points — never the coordinates themselves. The kernel trick replaces this inner product with a kernel function K(xᵢ, xⱼ) that implicitly computes the inner product in a (potentially infinite-dimensional) feature space.
Polynomial: K(xᵢ, xⱼ) = (γ xᵢ · xⱼ + r)^d
RBF / Gaussian: K(xᵢ, xⱼ) = exp(−γ ‖xᵢ − xⱼ‖²)
Sigmoid: K(xᵢ, xⱼ) = tanh(γ xᵢ · xⱼ + r)
γ controls how far the influence of a single training point reaches.
Small γ → broad influence (smoother boundary).
Large γ → local influence (wiggly boundary, risk of overfitting).
The RBF kernel is the default starting point — it makes no assumption about the shape of the boundary and works well with standardised features. Polynomial kernels (degree 2–3) are good when interaction terms between features matter (e.g. τ × ΔEN). For materials fingerprints (Coulomb matrix, SOAP descriptors), specialised kernels like the smooth-overlap-of-atomic-positions (SOAP) kernel outperform generic ones.
5. Our Dataset — Crystal Symmetry Classification
Twelve compounds labelled by crystal symmetry: Cubic (+1) or Non-cubic (−1). Features: Goldschmidt-like tolerance factor τ, electronegativity difference ΔEN, and average coordination number CN.
| Compound | τ (tolerance factor) | ΔEN | CN | Structure | Label |
|---|---|---|---|---|---|
| NaCl | 1.00 | 1.78 | 6 | Rock-salt | Cubic |
| MgO | 0.99 | 2.13 | 6 | Rock-salt | Cubic |
| BaTiO₃ | 1.06 | 1.89 | 6 | Perovskite | Cubic |
| SrTiO₃ | 1.00 | 1.97 | 6 | Perovskite | Cubic |
| CaF₂ | 1.03 | 2.98 | 8 | Fluorite | Cubic |
| FeO | 0.97 | 1.61 | 6 | Rock-salt | Cubic |
| TiO₂ | 0.78 | 1.90 | 6 | Rutile | Non-cubic |
| ZnO | 0.71 | 1.79 | 4 | Wurtzite | Non-cubic |
| Al₂O₃ | 0.65 | 2.03 | 6 | Corundum | Non-cubic |
| FeS | 0.72 | 0.43 | 6 | NiAs-type | Non-cubic |
| MnO | 0.96 | 1.55 | 6 | Rock-salt | Cubic |
| CrO₂ | 0.74 | 1.54 | 6 | Rutile | Non-cubic |
The Goldschmidt tolerance factor τ = (rA + rO) / [√2 (rB + rO)] predicts perovskite stability: τ ≈ 1.0 → ideal cubic, τ < 0.9 → distorted/non-cubic. Combined with ΔEN and CN, it forms a physically motivated feature set that the SVM can use to find a non-linear boundary between cubic and non-cubic crystal families.
6. Python Implementation
Linear SVM
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import LeaveOneOut, cross_val_score
import numpy as np
# Features: [τ, ΔEN, CN] Labels: +1 = Cubic, -1 = Non-cubic
X = np.array([
[1.00,1.78,6], [0.99,2.13,6], [1.06,1.89,6], [1.00,1.97,6],
[1.03,2.98,8], [0.97,1.61,6], [0.78,1.90,6], [0.71,1.79,4],
[0.65,2.03,6], [0.72,0.43,6], [0.96,1.55,6], [0.74,1.54,6]
])
y = np.array([1,1,1,1,1,1,-1,-1,-1,-1,1,-1])
# Linear SVM — always standardise features before SVM!
lin_svm = make_pipeline(StandardScaler(), SVC(kernel='linear', C=1.0))
lin_svm.fit(X, y)
print(classification_report(y, lin_svm.predict(X)))
# Number of support vectors
print("Support vectors per class:", lin_svm[-1].n_support_)
RBF Kernel SVM + Hyperparameter Tuning
# Grid search over C and γ
param_grid = {
'svc__C': [0.1, 1, 10, 100],
'svc__gamma': ['scale', 'auto', 0.1, 1.0]
}
rbf_pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf'))
gs = GridSearchCV(rbf_pipe, param_grid, cv=LeaveOneOut(), scoring='accuracy')
gs.fit(X, y)
print(f"Best C={gs.best_params_['svc__C']}, γ={gs.best_params_['svc__gamma']}")
print(f"Best LOO accuracy: {gs.best_score_:.2f}")
# Decision function (margin scores)
best_svm = gs.best_estimator_
scores = best_svm.decision_function(X)
for name, score in zip(names, scores):
print(f" {name:<8} margin score = {score:+.3f}")
7. SVM vs Other Models — When to Use SVM
| Criterion | Logistic Reg. | Random Forest | SVM (RBF) |
|---|---|---|---|
| Small dataset (< 100) | OK | OK | Best |
| High-dimensional features | OK | Slow | Excellent |
| Non-linear boundary | No | Yes | Yes (kernel) |
| Probability output | Native | Native | Requires Platt scaling |
| Interpretability | Coefficients | Feature importance | Support vectors only |
| Scaling needed | Yes | No | Yes (critical) |
| Large dataset (> 10k) | Fast | Fast | Slow (O(n²)–O(n³)) |
The SVM margin depends on ‖w‖ — the Euclidean distance in feature space. If τ ranges from 0.65–1.06 and CN ranges from 4–8, the CN axis dominates the distance calculation and the SVM effectively ignores τ. StandardScaler is not optional for SVM — it is mandatory.
8. Multi-class SVM — One-vs-One Strategy
SVMs are inherently binary classifiers. For multi-class problems (Metal / Semiconductor / Insulator), scikit-learn uses one-vs-one (OvO): it trains one binary SVM for every pair of classes (3 classes → 3 SVMs) and takes the majority vote. Alternatively, one-vs-rest (OvR) trains one SVM per class against all others.
One-vs-Rest (OvR): n classifiers → 3 SVMs for 3 classes
Predict: OvO → majority vote across all pairwise SVMs
OvR → class with highest decision function score
Quick Check
1. What are "support vectors" in an SVM?
2. In a soft-margin SVM, what happens when you set C to a very large value?
3. Why must features always be standardised before training an SVM?