Build your first ML model from scratch: understand the math, fit a line to DFT data, and interpret every coefficient physically.
Linear Regression
Supervised Regression
Band gap Eg (eV)
Lattice constant, ΔEN, Valence
You now know that supervised learning maps inputs to outputs using labelled examples. Linear regression is the simplest possible instance of this idea — and surprisingly powerful for materials science. Before building graph neural networks or random forests, every computational scientist should understand why a line fitted to DFT data can already tell you something physically meaningful.
The electronic band gap Eg (eV) of binary semiconductors from two simple crystal features: the lattice constant a (Å) and the electronegativity difference ΔEN between cation and anion.
1. The Core Idea — A Function that Maps Structure to Property
In Post 4 we defined supervised learning as learning a function f(x) ≈ y from labelled examples. Linear regression is the specific case where that function is a weighted sum of the input features.
ŷ = predicted property (e.g. band gap in eV)
x₁…xₙ = feature values (lattice constant, ΔEN, valence electrons…)
θ₀ = bias / intercept
θ₁…θₙ = weights (coefficients) — one per feature
For our two-feature band gap model this becomes:
a = lattice constant (Å)
ΔEN = electronegativity difference (Pauling scale)
We expect θ₁ < 0 (larger lattice → weaker bonding → narrower gap) and θ₂ > 0 (more ionic character → wider gap). If the fitted coefficients violate these signs, the model is learning noise — a key sanity check.
2. How the Model Learns — Minimising the Cost Function
The model learns by adjusting the weights θ until the predictions are as close as possible to the DFT values. We measure this closeness with the Mean Squared Error (MSE).
N = number of training examples
yᵢ = DFT value (ground truth)
ŷᵢ = model prediction
Goal: find θ that minimises J(θ)
For linear regression, there is an exact analytical solution — no iterative training required. This is called the Normal Equation:
X = design matrix (N × (n+1), includes a column of 1s for the bias)
y = vector of DFT labels (N × 1)
θ* = optimal weights (no iterations needed!)
If you have many features (n > ~10,000), computing (XᵀX)⁻¹ becomes computationally expensive. In that case, use iterative methods such as gradient descent instead. For small materials datasets (hundreds of compounds), the Normal Equation is perfect.
3. Gradient Descent — Learning Step by Step
For larger problems, or when you want to understand how neural networks learn, the iterative method gradient descent is fundamental. The idea: compute how much each weight contributes to the error, and nudge it in the direction that reduces the error.
η (eta) = learning rate — how large a step to take
∂J/∂θⱼ = gradient of the cost w.r.t. weight j
For MSE:
∂J/∂θⱼ = (2/N) · Σᵢ (ŷᵢ − yᵢ) · xᵢⱼ
Too large: the algorithm overshoots and diverges (J increases). Too small: convergence is painfully slow. Typical starting values: η = 0.01 or 0.001. For materials datasets, try 0.01 and monitor the loss curve.
- Initialise all weights θ to zero (or small random values)
- Compute predictions ŷ = Xθ for all training examples
- Compute the MSE cost J(θ)
- Compute gradients ∂J/∂θ for each weight
- Update all weights: θ ← θ − η · ∇J
- Repeat steps 2–5 until J stops decreasing (convergence)
4. Our Training Dataset — DFT Band Gaps of Binary Semiconductors
We use a small curated dataset of 12 binary compounds. All band gap values are from DFT-PBE calculations (standard functional; note that PBE systematically underestimates Eg, but the trends are correct for learning purposes).
| Compound | Type | a (Å) | ΔEN (Pauling) | Valence e⁻ | Eg DFT (eV) |
|---|---|---|---|---|---|
| ZnO | II-VI oxide | 3.25 | 1.79 | 12 | 3.44 |
| ZnS | II-VI | 5.41 | 0.93 | 12 | 2.10 |
| ZnSe | II-VI | 5.67 | 0.64 | 12 | 1.29 |
| ZnTe | II-VI | 6.10 | 0.15 | 12 | 0.74 |
| GaAs | III-V | 5.65 | 0.40 | 13 | 1.42 |
| GaP | III-V | 5.45 | 0.23 | 13 | 1.57 |
| GaN | III-V nitride | 3.19 | 1.23 | 13 | 3.51 |
| InP | III-V | 5.87 | 0.11 | 13 | 0.50 |
| InAs | III-V | 6.06 | 0.17 | 13 | 0.18 |
| Si | IV-IV | 5.43 | 0.00 | 14 | 1.12 |
| Ge | IV-IV | 5.66 | 0.00 | 14 | 0.00 |
| AlN | III-V nitride | 3.11 | 1.43 | 13 | 4.20 |
Lattice constant a encodes the bond length and hence orbital overlap. Electronegativity difference ΔEN encodes the ionicity of the bond. Both are accessible from composition alone — no DFT calculation required. This is what makes linear regression useful: you can screen millions of hypothetical compounds in milliseconds.
5. A Worked Example — Computing θ* by Hand
Using only the first 5 data points (ZnO, ZnS, GaAs, Si, InP) and two features (a, ΔEN), let us apply the Normal Equation step by step.
Step 1 — Build the design matrix X and label vector y
[1, 5.41, 0.93], [2.10],
[1, 5.65, 0.40], [1.42],
[1, 5.43, 0.00], [1.12],
[1, 5.87, 0.11]] [0.50]]
Column 1: all-ones vector for bias θ₀
Column 2: lattice constant a
Column 3: electronegativity diff ΔEN
Step 2 — Apply the Normal Equation θ* = (XᵀX)⁻¹ Xᵀy
After computing the matrix products (shown fully in the interactive app), the fitted coefficients are approximately:
θ₁ = −1.14 (lattice constant coefficient, Å⁻¹)
θ₂ = +1.29 (ΔEN coefficient, eV/Pauling unit)
Prediction: Êg = 7.81 − 1.14·a + 1.29·ΔEN
θ₁ = −1.14 means: adding 1 Å to the lattice constant decreases the predicted band gap by 1.14 eV. This is physically correct — larger bond distances weaken orbital overlap and reduce the bonding–antibonding splitting. θ₂ = +1.29 means: more ionic bonds (larger ΔEN) → wider gap. Also physically correct!
6. Evaluating the Model — R², MAE, and the Parity Plot
After fitting, we need to measure how good the model is. Three metrics are standard in materials science regression:
| Metric | Formula | Ideal value | What it tells you |
|---|---|---|---|
| R² score | 1 − Σ(yᵢ−ŷᵢ)² / Σ(yᵢ−ȳ)² | 1.0 (max) | Fraction of variance explained. R² = 0 means the model is no better than predicting the mean. |
| MAE | (1/N) Σ|yᵢ − ŷᵢ| | 0 (min) | Average absolute error in the same units as the target (eV). Easy to interpret physically. |
| RMSE | √(MSE) | 0 (min) | Penalises large errors more than MAE. Good for detecting outliers. |
For our 5-point example, the two-feature linear model achieves R² ≈ 0.91 and MAE ≈ 0.22 eV. This is a reasonable first result — far from perfect (a neural network on a large dataset might reach MAE < 0.05 eV), but interpretable and fast to train.
Plot DFT values (x-axis) vs. predicted values (y-axis). Perfect model = points on the y = x diagonal. Systematic bias appears as a tilted cluster. Outliers (convergence errors, wrong structure) jump out immediately. This is your most important diagnostic tool.
7. Underfitting and Overfitting — the Bias–Variance Trade-off
Every ML model faces a fundamental tension: it must be complex enough to capture real patterns (low bias) but not so complex that it memorises noise (low variance).
| Problem | Symptom | Cause | Fix |
|---|---|---|---|
| Underfitting (high bias) | High train error AND high test error | Model too simple (e.g. only 1 feature for a complex property) | Add more features, use a more powerful model |
| Overfitting (high variance) | Low train error BUT high test error | Too many features relative to number of training examples | Reduce features, add regularisation (Ridge/Lasso), get more data |
| Good fit | Low train error AND low test error | Right model complexity for the available data | Monitor both curves; use cross-validation |
8. Regularisation — Ridge and Lasso for Materials Descriptors
When you have many features (dozens of structural descriptors) but limited DFT data, overfitting is a real risk. Regularisation adds a penalty term to the cost function that discourages large weights.
λ = regularisation strength (hyperparameter; tune by cross-validation)
Effect: shrinks all weights toward zero, preventing extreme values
Effect: drives many weights exactly to zero → automatic feature selection
Useful when you have 50 descriptors but suspect only 5 actually matter
9. Python Implementation — From Scratch and with scikit-learn
From scratch using NumPy
# Training data: [a (Å), ΔEN]
X_raw = np.array([
[3.25, 1.79], # ZnO
[5.41, 0.93], # ZnS
[5.65, 0.40], # GaAs
[5.43, 0.00], # Si
[5.87, 0.11], # InP
])
y = np.array([3.44, 2.10, 1.42, 1.12, 0.50]) # Eg DFT (eV)
# Add bias column (column of 1s)
N = X_raw.shape[0]
X = np.hstack([np.ones((N, 1)), X_raw])
# Normal Equation: θ* = (XᵀX)⁻¹ Xᵀy
theta = np.linalg.inv(X.T @ X) @ X.T @ y
print(f"θ₀={theta[0]:.3f}, θ₁={theta[1]:.3f}, θ₂={theta[2]:.3f}")
# Predict and evaluate
y_pred = X @ theta
mae = np.mean(np.abs(y - y_pred))
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - np.mean(y))**2)
r2 = 1 - ss_res / ss_tot
print(f"R²={r2:.3f}, MAE={mae:.3f} eV")
Using scikit-learn (production workflow)
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error
import numpy as np
# Data (using full 12-compound dataset)
X = np.array([
[3.25,1.79],[5.41,0.93],[5.67,0.64],[6.10,0.15],
[5.65,0.40],[5.45,0.23],[3.19,1.23],[5.87,0.11],
[6.06,0.17],[5.43,0.00],[5.66,0.00],[3.11,1.43]
])
y = np.array([3.44,2.10,1.29,0.74,1.42,1.57,
3.51,0.50,0.18,1.12,0.00,4.20])
# Fit standard linear regression
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
print(f"R²={r2_score(y, y_pred):.3f}")
print(f"MAE={mean_absolute_error(y, y_pred):.3f} eV")
print(f"Coefficients: a={model.coef_[0]:.3f}, ΔEN={model.coef_[1]:.3f}")
# Cross-validation (leave-one-out for small datasets)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"CV R² = {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
10. Limitations — When Linear Regression Is Not Enough
| Limitation | Why it matters in materials science | Solution |
|---|---|---|
| Linearity assumption | Band gap vs. lattice constant is not truly linear across all families | Polynomial features, kernel methods, neural networks |
| Feature engineering required | You must hand-craft descriptors; cannot learn from raw crystal structure | Graph neural networks (CGCNN, ALIGNN) learn features automatically |
| No uncertainty estimate | Cannot report confidence intervals for predictions | Bayesian linear regression, Gaussian process regression |
| Extrapolation risk | Predicting a material far outside the training distribution is unreliable | Always check applicability domain; flag predictions outside training range |
Quick Check
1. In linear regression, what does the weight θ₂ = +1.29 for ΔEN tell you physically?
2. You have 8 features and 9 training examples. Which regularisation technique should you prefer?
3. Your linear regression gives training R² = 0.98 but test R² = 0.41. This is most likely: