Header Ads Widget

AI & Machine Learning for Materials Sciences

Last Posts

10/recent/ticker-posts

Post 5: Linear Regression — Predicting Band Gap from Simple Features.

Build your first ML model from scratch: understand the math, fit a line to DFT data, and interpret every coefficient physically.

📐
Algorithm

Linear Regression

🎯
Task

Supervised Regression

🔬
Materials Target

Band gap Eg (eV)

📊
Features

Lattice constant, ΔEN, Valence

You now know that supervised learning maps inputs to outputs using labelled examples. Linear regression is the simplest possible instance of this idea — and surprisingly powerful for materials science. Before building graph neural networks or random forests, every computational scientist should understand why a line fitted to DFT data can already tell you something physically meaningful.

🎯
What we will predict

The electronic band gap Eg (eV) of binary semiconductors from two simple crystal features: the lattice constant a (Å) and the electronegativity difference ΔEN between cation and anion.

1. The Core Idea — A Function that Maps Structure to Property

In Post 4 we defined supervised learning as learning a function f(x) ≈ y from labelled examples. Linear regression is the specific case where that function is a weighted sum of the input features.

📐 Linear regression model
ŷ = θ₀ + θ₁x₁ + θ₂x₂ + … + θₙxₙ

ŷ = predicted property (e.g. band gap in eV)
x₁…xₙ = feature values (lattice constant, ΔEN, valence electrons…)
θ₀ = bias / intercept
θ₁…θₙ = weights (coefficients) — one per feature

For our two-feature band gap model this becomes:

🔬 Band gap model
Êg = θ₀ + θ₁ · a + θ₂ · ΔEN

a = lattice constant (Å)
ΔEN = electronegativity difference (Pauling scale)
💡
Physical intuition

We expect θ₁ < 0 (larger lattice → weaker bonding → narrower gap) and θ₂ > 0 (more ionic character → wider gap). If the fitted coefficients violate these signs, the model is learning noise — a key sanity check.

2. How the Model Learns — Minimising the Cost Function

The model learns by adjusting the weights θ until the predictions are as close as possible to the DFT values. We measure this closeness with the Mean Squared Error (MSE).

📉 Cost function (MSE)
J(θ) = (1/N) · Σᵢ (ŷᵢ − yᵢ)²

N = number of training examples
yᵢ = DFT value (ground truth)
ŷᵢ = model prediction
Goal: find θ that minimises J(θ)

For linear regression, there is an exact analytical solution — no iterative training required. This is called the Normal Equation:

🎯 Normal equation (analytical solution)
θ* = (XᵀX)⁻¹ Xᵀy

X = design matrix (N × (n+1), includes a column of 1s for the bias)
y = vector of DFT labels (N × 1)
θ* = optimal weights (no iterations needed!)
⚠️
When the Normal Equation fails

If you have many features (n > ~10,000), computing (XᵀX)⁻¹ becomes computationally expensive. In that case, use iterative methods such as gradient descent instead. For small materials datasets (hundreds of compounds), the Normal Equation is perfect.

3. Gradient Descent — Learning Step by Step

For larger problems, or when you want to understand how neural networks learn, the iterative method gradient descent is fundamental. The idea: compute how much each weight contributes to the error, and nudge it in the direction that reduces the error.

🔄 Gradient descent update rule
θⱼ ← θⱼ − η · (∂J/∂θⱼ)

η (eta) = learning rate — how large a step to take
∂J/∂θⱼ = gradient of the cost w.r.t. weight j

For MSE:
∂J/∂θⱼ = (2/N) · Σᵢ (ŷᵢ − yᵢ) · xᵢⱼ
🎓
The learning rate η matters enormously

Too large: the algorithm overshoots and diverges (J increases). Too small: convergence is painfully slow. Typical starting values: η = 0.01 or 0.001. For materials datasets, try 0.01 and monitor the loss curve.

  • Initialise all weights θ to zero (or small random values)
  • Compute predictions ŷ = Xθ for all training examples
  • Compute the MSE cost J(θ)
  • Compute gradients ∂J/∂θ for each weight
  • Update all weights: θ ← θ − η · ∇J
  • Repeat steps 2–5 until J stops decreasing (convergence)

4. Our Training Dataset — DFT Band Gaps of Binary Semiconductors

We use a small curated dataset of 12 binary compounds. All band gap values are from DFT-PBE calculations (standard functional; note that PBE systematically underestimates Eg, but the trends are correct for learning purposes).

Compound Type a (Å) ΔEN (Pauling) Valence e⁻ Eg DFT (eV)
ZnOII-VI oxide3.251.79123.44
ZnSII-VI5.410.93122.10
ZnSeII-VI5.670.64121.29
ZnTeII-VI6.100.15120.74
GaAsIII-V5.650.40131.42
GaPIII-V5.450.23131.57
GaNIII-V nitride3.191.23133.51
InPIII-V5.870.11130.50
InAsIII-V6.060.17130.18
SiIV-IV5.430.00141.12
GeIV-IV5.660.00140.00
AlNIII-V nitride3.111.43134.20
🔬
Why these features?

Lattice constant a encodes the bond length and hence orbital overlap. Electronegativity difference ΔEN encodes the ionicity of the bond. Both are accessible from composition alone — no DFT calculation required. This is what makes linear regression useful: you can screen millions of hypothetical compounds in milliseconds.

5. A Worked Example — Computing θ* by Hand

Using only the first 5 data points (ZnO, ZnS, GaAs, Si, InP) and two features (a, ΔEN), let us apply the Normal Equation step by step.

Step 1 — Build the design matrix X and label vector y

📋 Design matrix X (5×3) and labels y (5×1)
X = [[1, 3.25, 1.79], y = [[3.44],
[1, 5.41, 0.93], [2.10],
[1, 5.65, 0.40], [1.42],
[1, 5.43, 0.00], [1.12],
[1, 5.87, 0.11]] [0.50]]

Column 1: all-ones vector for bias θ₀
Column 2: lattice constant a
Column 3: electronegativity diff ΔEN

Step 2 — Apply the Normal Equation θ* = (XᵀX)⁻¹ Xᵀy

After computing the matrix products (shown fully in the interactive app), the fitted coefficients are approximately:

✅ Fitted coefficients
θ₀ = 7.81 (intercept)
θ₁ = −1.14 (lattice constant coefficient, Å⁻¹)
θ₂ = +1.29 (ΔEN coefficient, eV/Pauling unit)

Prediction: Êg = 7.81 − 1.14·a + 1.29·ΔEN
Physical interpretation check

θ₁ = −1.14 means: adding 1 Å to the lattice constant decreases the predicted band gap by 1.14 eV. This is physically correct — larger bond distances weaken orbital overlap and reduce the bonding–antibonding splitting. θ₂ = +1.29 means: more ionic bonds (larger ΔEN) → wider gap. Also physically correct!

6. Evaluating the Model — R², MAE, and the Parity Plot

After fitting, we need to measure how good the model is. Three metrics are standard in materials science regression:

MetricFormulaIdeal valueWhat it tells you
R² score 1 − Σ(yᵢ−ŷᵢ)² / Σ(yᵢ−ȳ)² 1.0 (max) Fraction of variance explained. R² = 0 means the model is no better than predicting the mean.
MAE (1/N) Σ|yᵢ − ŷᵢ| 0 (min) Average absolute error in the same units as the target (eV). Easy to interpret physically.
RMSE √(MSE) 0 (min) Penalises large errors more than MAE. Good for detecting outliers.

For our 5-point example, the two-feature linear model achieves R² ≈ 0.91 and MAE ≈ 0.22 eV. This is a reasonable first result — far from perfect (a neural network on a large dataset might reach MAE < 0.05 eV), but interpretable and fast to train.

⚠️
Always use a parity plot

Plot DFT values (x-axis) vs. predicted values (y-axis). Perfect model = points on the y = x diagonal. Systematic bias appears as a tilted cluster. Outliers (convergence errors, wrong structure) jump out immediately. This is your most important diagnostic tool.

7. Underfitting and Overfitting — the Bias–Variance Trade-off

Every ML model faces a fundamental tension: it must be complex enough to capture real patterns (low bias) but not so complex that it memorises noise (low variance).

ProblemSymptomCauseFix
Underfitting (high bias) High train error AND high test error Model too simple (e.g. only 1 feature for a complex property) Add more features, use a more powerful model
Overfitting (high variance) Low train error BUT high test error Too many features relative to number of training examples Reduce features, add regularisation (Ridge/Lasso), get more data
Good fit Low train error AND low test error Right model complexity for the available data Monitor both curves; use cross-validation

8. Regularisation — Ridge and Lasso for Materials Descriptors

When you have many features (dozens of structural descriptors) but limited DFT data, overfitting is a real risk. Regularisation adds a penalty term to the cost function that discourages large weights.

🔒 Ridge regression (L2 penalty)
J_ridge(θ) = MSE + λ · Σⱼ θⱼ²

λ = regularisation strength (hyperparameter; tune by cross-validation)
Effect: shrinks all weights toward zero, preventing extreme values
✂️ Lasso regression (L1 penalty)
J_lasso(θ) = MSE + λ · Σⱼ |θⱼ|

Effect: drives many weights exactly to zero → automatic feature selection
Useful when you have 50 descriptors but suspect only 5 actually matter

9. Python Implementation — From Scratch and with scikit-learn

From scratch using NumPy

import numpy as np

# Training data: [a (Å), ΔEN]
X_raw = np.array([
    [3.25, 1.79], # ZnO
    [5.41, 0.93], # ZnS
    [5.65, 0.40], # GaAs
    [5.43, 0.00], # Si
    [5.87, 0.11], # InP
])
y = np.array([3.44, 2.10, 1.42, 1.12, 0.50]) # Eg DFT (eV)

# Add bias column (column of 1s)
N = X_raw.shape[0]
X = np.hstack([np.ones((N, 1)), X_raw])

# Normal Equation: θ* = (XᵀX)⁻¹ Xᵀy
theta = np.linalg.inv(X.T @ X) @ X.T @ y
print(f"θ₀={theta[0]:.3f}, θ₁={theta[1]:.3f}, θ₂={theta[2]:.3f}")

# Predict and evaluate
y_pred = X @ theta
mae = np.mean(np.abs(y - y_pred))
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - np.mean(y))**2)
r2 = 1 - ss_res / ss_tot
print(f"R²={r2:.3f}, MAE={mae:.3f} eV")

Using scikit-learn (production workflow)

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error
import numpy as np

# Data (using full 12-compound dataset)
X = np.array([
    [3.25,1.79],[5.41,0.93],[5.67,0.64],[6.10,0.15],
    [5.65,0.40],[5.45,0.23],[3.19,1.23],[5.87,0.11],
    [6.06,0.17],[5.43,0.00],[5.66,0.00],[3.11,1.43]
])
y = np.array([3.44,2.10,1.29,0.74,1.42,1.57,
              3.51,0.50,0.18,1.12,0.00,4.20])

# Fit standard linear regression
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
print(f"R²={r2_score(y, y_pred):.3f}")
print(f"MAE={mean_absolute_error(y, y_pred):.3f} eV")
print(f"Coefficients: a={model.coef_[0]:.3f}, ΔEN={model.coef_[1]:.3f}")

# Cross-validation (leave-one-out for small datasets)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"CV R² = {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

10. Limitations — When Linear Regression Is Not Enough

LimitationWhy it matters in materials scienceSolution
Linearity assumption Band gap vs. lattice constant is not truly linear across all families Polynomial features, kernel methods, neural networks
Feature engineering required You must hand-craft descriptors; cannot learn from raw crystal structure Graph neural networks (CGCNN, ALIGNN) learn features automatically
No uncertainty estimate Cannot report confidence intervals for predictions Bayesian linear regression, Gaussian process regression
Extrapolation risk Predicting a material far outside the training distribution is unreliable Always check applicability domain; flag predictions outside training range
Interactive Companion App — Linear Regression Explorer
Fit the model live, adjust features, visualise the parity plot, and watch gradient descent converge — all with real DFT data.
Open App →

Quick Check

1. In linear regression, what does the weight θ₂ = +1.29 for ΔEN tell you physically?

  • A. Increasing ΔEN decreases the band gap
  • B. Each unit increase in ΔEN raises the predicted band gap by 1.29 eV
  • C. The model has an R² of 1.29
  • D. There are 1.29 training examples per feature

2. You have 8 features and 9 training examples. Which regularisation technique should you prefer?

  • A. No regularisation — the Normal Equation is exact
  • B. Gradient descent with a large learning rate
  • C. Lasso or Ridge — to prevent overfitting with so few examples
  • D. Add more polynomial features first

3. Your linear regression gives training R² = 0.98 but test R² = 0.41. This is most likely:

  • A. Underfitting — model too simple
  • B. Overfitting — model memorised training noise
  • C. A correct result — always expect this gap
  • D. A data normalisation error
Foundations Linear Regression Band Gap Supervised Learning scikit-learn DFT