Header Ads Widget

AI & Machine Learning for Materials Sciences

Last Posts

10/recent/ticker-posts

Post 8: Support Vector Machines (SVMs) — with a crystal property example

Learn how SVMs find the widest possible gap between classes, why support vectors alone define the boundary, and how the kernel trick extends SVMs to non-linear crystal property classification — without ever computing high-dimensional coordinates explicitly.

Module 2
📐
Algorithm

Support Vector Machine (SVM)

🎯
Key Idea

Maximum margin hyperplane

🔬
Materials Target

Crystal symmetry classification

🌀
Key Trick

Kernel function (RBF, polynomial)

Decision trees and random forests (Post 7) partition feature space with axis-aligned rectangles. Support Vector Machines (SVMs) take a fundamentally different approach: they search for the single hyperplane (line in 2D, plane in 3D, hyperplane in higher dimensions) that separates the classes with the maximum possible margin — the widest gap between the boundary and the nearest training points of each class. This maximum-margin principle gives SVMs excellent generalisation even with few training samples, making them particularly well-suited to materials science datasets, where DFT-labelled compounds are expensive to compute.

📐
What we will classify

Twelve transition-metal compounds labelled by crystal symmetry group: cubic (rock-salt / fluorite / perovskite) vs. non-cubic (rutile / wurtzite / corundum / NiAs-type), using tolerance factor τ, electronegativity difference ΔEN, and average coordination number CN as features.

1. The Maximum-Margin Hyperplane

Suppose our training data are linearly separable in feature space — there exists at least one hyperplane that correctly separates all positive from all negative examples. Many such hyperplanes exist; which one should we choose? The SVM selects the one that maximises the margin — the perpendicular distance from the hyperplane to the nearest training point on each side.

📐 Hyperplane equation
w · x + b = 0

w = weight vector (normal to the hyperplane)
b = bias (offset from origin)
Margin = 2 / ‖w

Positive class: w · x + b ≥ +1
Negative class: w · x + b ≤ −1
💡
Why maximise the margin?

A larger margin means the classifier is more confident — points far from the boundary are classified with high certainty. By VC theory, maximising the margin minimises the upper bound on generalisation error, even for small training sets. This is why SVMs outperform logistic regression when labelled data are scarce (as in DFT databases).

2. The Optimisation Problem

Maximising the margin 2/‖w‖ is equivalent to minimising ½‖w‖² subject to the constraint that all training points are correctly classified and lie outside the margin.

🔧 Hard-margin SVM (primal form)
minimise   ½ ‖w‖²
subject to   yᵢ(w · xᵢ + b) ≥ 1   for all i

yᵢ ∈ {−1, +1} = class labels
Solved by quadratic programming (QP)
🔧 Soft-margin SVM (real data — with slack ξ)
minimise   ½ ‖w‖² + C · Σᵢ ξᵢ
subject to   yᵢ(w · xᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0

ξᵢ = slack variable (how much point i violates the margin)
C = regularisation parameter: large C → tight margin (low bias, high variance)
                          small C → wide margin (higher bias, lower variance)
⚠️
Choosing C is critical

In materials datasets with noisy DFT labels (PBE vs. HSE band gaps, finite k-point sampling errors), a small C is preferred — it tolerates a few misclassified training points in exchange for a wider, more robust margin. Always tune C with cross-validation, not on the training set.

3. Support Vectors — The Points That Matter

The optimal hyperplane is determined entirely by a small subset of training points — the support vectors — those that lie exactly on the margin boundaries (‖w · xᵢ + b‖ = 1). All other training points can be removed without changing the decision boundary.

🔬
Physical meaning for crystal classification

Support vectors are the "borderline" compounds — those whose crystal structure is closest to ambiguous (e.g. a near-cubic distorted perovskite with tolerance factor τ ≈ 0.88, right at the cubic/non-cubic boundary). They are the most informative compounds in the training set; adding more points far from the boundary does not improve the model.

4. The Kernel Trick — Non-linear Boundaries Without the Cost

Most crystal property relationships are non-linear. Fortunately, the SVM dual formulation only ever requires inner products xᵢ · xⱼ between training points — never the coordinates themselves. The kernel trick replaces this inner product with a kernel function K(xᵢ, xⱼ) that implicitly computes the inner product in a (potentially infinite-dimensional) feature space.

🌀 Common kernel functions
Linear:      K(xᵢ, xⱼ) = xᵢ · xⱼ
Polynomial:   K(xᵢ, xⱼ) = (γ xᵢ · xⱼ + r)^d
RBF / Gaussian: K(xᵢ, xⱼ) = exp(−γ ‖xᵢ − xⱼ‖²)
Sigmoid:     K(xᵢ, xⱼ) = tanh(γ xᵢ · xⱼ + r)

γ controls how far the influence of a single training point reaches.
Small γ → broad influence (smoother boundary).
Large γ → local influence (wiggly boundary, risk of overfitting).
🧪
Which kernel for crystal properties?

The RBF kernel is the default starting point — it makes no assumption about the shape of the boundary and works well with standardised features. Polynomial kernels (degree 2–3) are good when interaction terms between features matter (e.g. τ × ΔEN). For materials fingerprints (Coulomb matrix, SOAP descriptors), specialised kernels like the smooth-overlap-of-atomic-positions (SOAP) kernel outperform generic ones.

5. Our Dataset — Crystal Symmetry Classification

Twelve compounds labelled by crystal symmetry: Cubic (+1) or Non-cubic (−1). Features: Goldschmidt-like tolerance factor τ, electronegativity difference ΔEN, and average coordination number CN.

Compound τ (tolerance factor) ΔEN CN Structure Label
NaCl 1.001.786Rock-saltCubic
MgO 0.992.136Rock-saltCubic
BaTiO₃1.061.896PerovskiteCubic
SrTiO₃1.001.976PerovskiteCubic
CaF₂ 1.032.988FluoriteCubic
FeO 0.971.616Rock-saltCubic
TiO₂ 0.781.906RutileNon-cubic
ZnO 0.711.794WurtziteNon-cubic
Al₂O₃ 0.652.036CorundumNon-cubic
FeS 0.720.436NiAs-typeNon-cubic
MnO 0.961.556Rock-saltCubic
CrO₂ 0.741.546RutileNon-cubic
🔬
Tolerance factor and crystal symmetry

The Goldschmidt tolerance factor τ = (rA + rO) / [√2 (rB + rO)] predicts perovskite stability: τ ≈ 1.0 → ideal cubic, τ < 0.9 → distorted/non-cubic. Combined with ΔEN and CN, it forms a physically motivated feature set that the SVM can use to find a non-linear boundary between cubic and non-cubic crystal families.

6. Python Implementation

Linear SVM

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import LeaveOneOut, cross_val_score
import numpy as np

# Features: [τ, ΔEN, CN] Labels: +1 = Cubic, -1 = Non-cubic
X = np.array([
    [1.00,1.78,6], [0.99,2.13,6], [1.06,1.89,6], [1.00,1.97,6],
    [1.03,2.98,8], [0.97,1.61,6], [0.78,1.90,6], [0.71,1.79,4],
    [0.65,2.03,6], [0.72,0.43,6], [0.96,1.55,6], [0.74,1.54,6]
])
y = np.array([1,1,1,1,1,1,-1,-1,-1,-1,1,-1])

# Linear SVM — always standardise features before SVM!
lin_svm = make_pipeline(StandardScaler(), SVC(kernel='linear', C=1.0))
lin_svm.fit(X, y)
print(classification_report(y, lin_svm.predict(X)))

# Number of support vectors
print("Support vectors per class:", lin_svm[-1].n_support_)

RBF Kernel SVM + Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Grid search over C and γ
param_grid = {
    'svc__C': [0.1, 1, 10, 100],
    'svc__gamma': ['scale', 'auto', 0.1, 1.0]
}
rbf_pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf'))
gs = GridSearchCV(rbf_pipe, param_grid, cv=LeaveOneOut(), scoring='accuracy')
gs.fit(X, y)

print(f"Best C={gs.best_params_['svc__C']}, γ={gs.best_params_['svc__gamma']}")
print(f"Best LOO accuracy: {gs.best_score_:.2f}")

# Decision function (margin scores)
best_svm = gs.best_estimator_
scores = best_svm.decision_function(X)
for name, score in zip(names, scores):
    print(f" {name:<8} margin score = {score:+.3f}")

7. SVM vs Other Models — When to Use SVM

CriterionLogistic Reg.Random ForestSVM (RBF)
Small dataset (< 100) OK OK Best
High-dimensional features OK Slow Excellent
Non-linear boundary No Yes Yes (kernel)
Probability output Native Native Requires Platt scaling
Interpretability Coefficients Feature importance Support vectors only
Scaling needed Yes No Yes (critical)
Large dataset (> 10k) Fast Fast Slow (O(n²)–O(n³))
⚠️
Always standardise before SVM

The SVM margin depends on ‖w‖ — the Euclidean distance in feature space. If τ ranges from 0.65–1.06 and CN ranges from 4–8, the CN axis dominates the distance calculation and the SVM effectively ignores τ. StandardScaler is not optional for SVM — it is mandatory.

8. Multi-class SVM — One-vs-One Strategy

SVMs are inherently binary classifiers. For multi-class problems (Metal / Semiconductor / Insulator), scikit-learn uses one-vs-one (OvO): it trains one binary SVM for every pair of classes (3 classes → 3 SVMs) and takes the majority vote. Alternatively, one-vs-rest (OvR) trains one SVM per class against all others.

🔀 Multi-class strategies
One-vs-One (OvO): n(n−1)/2 classifiers  →  3 SVMs for 3 classes
One-vs-Rest (OvR): n classifiers  →  3 SVMs for 3 classes

Predict: OvO → majority vote across all pairwise SVMs
          OvR → class with highest decision function score
📐
App 8 — SVM Explorer: Margin, Kernel & Support Vectors
Adjust C, kernel type and γ interactively. Watch the decision boundary, margin width, and highlighted support vectors update live on our 12-compound crystal symmetry dataset.
Open App →

Quick Check

1. What are "support vectors" in an SVM?

  • A. All training points used to compute the hyperplane
  • B. The training points closest to the decision boundary — the ones that define the margin
  • C. The feature weights w of the hyperplane
  • D. Vectors that point in the direction of maximum variance

2. In a soft-margin SVM, what happens when you set C to a very large value?

  • A. The margin becomes very wide and the model underfits
  • B. The model tries hard to classify all training points correctly — narrow margin, risk of overfitting
  • C. The kernel function is deactivated
  • D. The number of support vectors increases to include all training points

3. Why must features always be standardised before training an SVM?

  • A. SVM requires integer-valued features
  • B. Standardisation converts the data to probabilities
  • C. The SVM margin is measured in Euclidean distance — features on larger scales would dominate and make other features irrelevant
  • D. StandardScaler removes outliers that would become support vectors
Core Algorithms SVM Kernel Trick Maximum Margin Crystal Symmetry RBF Kernel Soft Margin