Post 5: Linear Regression — Predicting Band Gap from Simple Features.

Module 2 · Core Algorithms

Build your first ML model from scratch: understand the math, fit a line to DFT data, and interpret every coefficient physically.

Abderrahmane REGGAD · June 14, 2026

📐

Algorithm

Linear Regression

🎯

Task

Supervised Regression

🔬

Materials Target

Band gap E_g (eV)

📊

Features

Lattice constant, ΔEN, Valence

You now know that supervised learning maps inputs to outputs using labelled examples. Linear regression is the simplest possible instance of this idea — and surprisingly powerful for materials science. Before building graph neural networks or random forests, every computational scientist should understand why a line fitted to DFT data can already tell you something physically meaningful.

🎯

What we will predict

The electronic band gap E_g (eV) of binary semiconductors from two simple crystal features: the lattice constant a (Å) and the electronegativity difference ΔEN between cation and anion.

1. The Core Idea — A Function that Maps Structure to Property

In Post 4 we defined supervised learning as learning a function f(x) ≈ y from labelled examples. Linear regression is the specific case where that function is a weighted sum of the input features.

📐 Linear regression model

ŷ = θ₀ + θ₁x₁ + θ₂x₂ + … + θₙxₙ

ŷ = predicted property (e.g. band gap in eV)
x₁…xₙ = feature values (lattice constant, ΔEN, valence electrons…)
θ₀ = bias / intercept
θ₁…θₙ = weights (coefficients) — one per feature

For our two-feature band gap model this becomes:

🔬 Band gap model

Ê_g = θ₀ + θ₁ · a + θ₂ · ΔEN

a = lattice constant (Å)
ΔEN = electronegativity difference (Pauling scale)

💡

Physical intuition

We expect θ₁ < 0 (larger lattice → weaker bonding → narrower gap) and θ₂ > 0 (more ionic character → wider gap). If the fitted coefficients violate these signs, the model is learning noise — a key sanity check.

2. How the Model Learns — Minimising the Cost Function

The model learns by adjusting the weights θ until the predictions are as close as possible to the DFT values. We measure this closeness with the Mean Squared Error (MSE).

📉 Cost function (MSE)

J(θ) = (1/N) · Σᵢ (ŷᵢ − yᵢ)²

N = number of training examples
yᵢ = DFT value (ground truth)
ŷᵢ = model prediction
Goal: find θ that minimises J(θ)

For linear regression, there is an exact analytical solution — no iterative training required. This is called the Normal Equation:

🎯 Normal equation (analytical solution)

θ* = (XᵀX)⁻¹ Xᵀy

X = design matrix (N × (n+1), includes a column of 1s for the bias)
y = vector of DFT labels (N × 1)
θ* = optimal weights (no iterations needed!)

⚠️

When the Normal Equation fails

If you have many features (n > ~10,000), computing (XᵀX)⁻¹ becomes computationally expensive. In that case, use iterative methods such as gradient descent instead. For small materials datasets (hundreds of compounds), the Normal Equation is perfect.

3. Gradient Descent — Learning Step by Step

For larger problems, or when you want to understand how neural networks learn, the iterative method gradient descent is fundamental. The idea: compute how much each weight contributes to the error, and nudge it in the direction that reduces the error.

🔄 Gradient descent update rule

θⱼ ← θⱼ − η · (∂J/∂θⱼ)

η (eta) = learning rate — how large a step to take
∂J/∂θⱼ = gradient of the cost w.r.t. weight j

For MSE:
∂J/∂θⱼ = (2/N) · Σᵢ (ŷᵢ − yᵢ) · xᵢⱼ

🎓

The learning rate η matters enormously

Too large: the algorithm overshoots and diverges (J increases). Too small: convergence is painfully slow. Typical starting values: η = 0.01 or 0.001. For materials datasets, try 0.01 and monitor the loss curve.

Initialise all weights θ to zero (or small random values)
Compute predictions ŷ = Xθ for all training examples
Compute the MSE cost J(θ)
Compute gradients ∂J/∂θ for each weight
Update all weights: θ ← θ − η · ∇J
Repeat steps 2–5 until J stops decreasing (convergence)

4. Our Training Dataset — DFT Band Gaps of Binary Semiconductors

We use a small curated dataset of 12 binary compounds. All band gap values are from DFT-PBE calculations (standard functional; note that PBE systematically underestimates E_g, but the trends are correct for learning purposes).

Compound	Type	a (Å)	ΔEN (Pauling)	Valence e⁻	E_g DFT (eV)
ZnO	II-VI oxide	3.25	1.79	12	3.44
ZnS	II-VI	5.41	0.93	12	2.10
ZnSe	II-VI	5.67	0.64	12	1.29
ZnTe	II-VI	6.10	0.15	12	0.74
GaAs	III-V	5.65	0.40	13	1.42
GaP	III-V	5.45	0.23	13	1.57
GaN	III-V nitride	3.19	1.23	13	3.51
InP	III-V	5.87	0.11	13	0.50
InAs	III-V	6.06	0.17	13	0.18
Si	IV-IV	5.43	0.00	14	1.12
Ge	IV-IV	5.66	0.00	14	0.00
AlN	III-V nitride	3.11	1.43	13	4.20

🔬

Why these features?

Lattice constant a encodes the bond length and hence orbital overlap. Electronegativity difference ΔEN encodes the ionicity of the bond. Both are accessible from composition alone — no DFT calculation required. This is what makes linear regression useful: you can screen millions of hypothetical compounds in milliseconds.

5. A Worked Example — Computing θ* by Hand

Using only the first 5 data points (ZnO, ZnS, GaAs, Si, InP) and two features (a, ΔEN), let us apply the Normal Equation step by step.

Step 1 — Build the design matrix X and label vector y

📋 Design matrix X (5×3) and labels y (5×1)

X = [[1, 3.25, 1.79], y = [[3.44],
[1, 5.41, 0.93], [2.10],
[1, 5.65, 0.40], [1.42],
[1, 5.43, 0.00], [1.12],
[1, 5.87, 0.11]] [0.50]]

Column 1: all-ones vector for bias θ₀
Column 2: lattice constant a
Column 3: electronegativity diff ΔEN

Step 2 — Apply the Normal Equation θ* = (XᵀX)⁻¹ Xᵀy

After computing the matrix products (shown fully in the interactive app), the fitted coefficients are approximately:

✅ Fitted coefficients

θ₀ = 7.81 (intercept)
θ₁ = −1.14 (lattice constant coefficient, Å⁻¹)
θ₂ = +1.29 (ΔEN coefficient, eV/Pauling unit)

Prediction: Ê_g = 7.81 − 1.14·a + 1.29·ΔEN

✅

Physical interpretation check

θ₁ = −1.14 means: adding 1 Å to the lattice constant decreases the predicted band gap by 1.14 eV. This is physically correct — larger bond distances weaken orbital overlap and reduce the bonding–antibonding splitting. θ₂ = +1.29 means: more ionic bonds (larger ΔEN) → wider gap. Also physically correct!

6. Evaluating the Model — R², MAE, and the Parity Plot

After fitting, we need to measure how good the model is. Three metrics are standard in materials science regression:

Metric	Formula	Ideal value	What it tells you
R² score	1 − Σ(yᵢ−ŷᵢ)² / Σ(yᵢ−ȳ)²	1.0 (max)	Fraction of variance explained. R² = 0 means the model is no better than predicting the mean.
MAE	(1/N) Σ\|yᵢ − ŷᵢ\|	0 (min)	Average absolute error in the same units as the target (eV). Easy to interpret physically.
RMSE	√(MSE)	0 (min)	Penalises large errors more than MAE. Good for detecting outliers.

For our 5-point example, the two-feature linear model achieves R² ≈ 0.91 and MAE ≈ 0.22 eV. This is a reasonable first result — far from perfect (a neural network on a large dataset might reach MAE < 0.05 eV), but interpretable and fast to train.

⚠️

Always use a parity plot

Plot DFT values (x-axis) vs. predicted values (y-axis). Perfect model = points on the y = x diagonal. Systematic bias appears as a tilted cluster. Outliers (convergence errors, wrong structure) jump out immediately. This is your most important diagnostic tool.

7. Underfitting and Overfitting — the Bias–Variance Trade-off

Every ML model faces a fundamental tension: it must be complex enough to capture real patterns (low bias) but not so complex that it memorises noise (low variance).

Problem	Symptom	Cause	Fix
Underfitting (high bias)	High train error AND high test error	Model too simple (e.g. only 1 feature for a complex property)	Add more features, use a more powerful model
Overfitting (high variance)	Low train error BUT high test error	Too many features relative to number of training examples	Reduce features, add regularisation (Ridge/Lasso), get more data
Good fit	Low train error AND low test error	Right model complexity for the available data	Monitor both curves; use cross-validation

8. Regularisation — Ridge and Lasso for Materials Descriptors

When you have many features (dozens of structural descriptors) but limited DFT data, overfitting is a real risk. Regularisation adds a penalty term to the cost function that discourages large weights.

🔒 Ridge regression (L2 penalty)

J_ridge(θ) = MSE + λ · Σⱼ θⱼ²

λ = regularisation strength (hyperparameter; tune by cross-validation)
Effect: shrinks all weights toward zero, preventing extreme values

✂️ Lasso regression (L1 penalty)

J_lasso(θ) = MSE + λ · Σⱼ |θⱼ|

Effect: drives many weights exactly to zero → automatic feature selection
Useful when you have 50 descriptors but suspect only 5 actually matter

9. Python Implementation — From Scratch and with scikit-learn

From scratch using NumPy

import numpy as np

# Training data: [a (Å), ΔEN]

X_raw = np.array([

    [3.25, 1.79],  # ZnO

    [5.41, 0.93],  # ZnS

    [5.65, 0.40],  # GaAs

    [5.43, 0.00],  # Si

    [5.87, 0.11],  # InP

])

y = np.array([3.44, 2.10, 1.42, 1.12, 0.50])  # Eg DFT (eV)

# Add bias column (column of 1s)

N = X_raw.shape[0]

X = np.hstack([np.ones((N, 1)), X_raw])

# Normal Equation: θ* = (XᵀX)⁻¹ Xᵀy

theta = np.linalg.inv(X.T @ X) @ X.T @ y

print(f"θ₀={theta[0]:.3f}, θ₁={theta[1]:.3f}, θ₂={theta[2]:.3f}")

# Predict and evaluate

y_pred = X @ theta

mae = np.mean(np.abs(y - y_pred))

ss_res = np.sum((y - y_pred)**2)

ss_tot = np.sum((y - np.mean(y))**2)

r2 = 1 - ss_res / ss_tot

print(f"R²={r2:.3f}, MAE={mae:.3f} eV")

Using scikit-learn (production workflow)

from sklearn.linear_model import LinearRegression, Ridge

from sklearn.model_selection import cross_val_score

from sklearn.metrics import r2_score, mean_absolute_error

import numpy as np

# Data (using full 12-compound dataset)

X = np.array([

    [3.25,1.79],[5.41,0.93],[5.67,0.64],[6.10,0.15],

    [5.65,0.40],[5.45,0.23],[3.19,1.23],[5.87,0.11],

    [6.06,0.17],[5.43,0.00],[5.66,0.00],[3.11,1.43]

])

y = np.array([3.44,2.10,1.29,0.74,1.42,1.57,

              3.51,0.50,0.18,1.12,0.00,4.20])

# Fit standard linear regression

model = LinearRegression()

model.fit(X, y)

y_pred = model.predict(X)

print(f"R²={r2_score(y, y_pred):.3f}")

print(f"MAE={mean_absolute_error(y, y_pred):.3f} eV")

print(f"Coefficients: a={model.coef_[0]:.3f}, ΔEN={model.coef_[1]:.3f}")

# Cross-validation (leave-one-out for small datasets)

cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print(f"CV R² = {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

10. Limitations — When Linear Regression Is Not Enough

Limitation	Why it matters in materials science	Solution
Linearity assumption	Band gap vs. lattice constant is not truly linear across all families	Polynomial features, kernel methods, neural networks
Feature engineering required	You must hand-craft descriptors; cannot learn from raw crystal structure	Graph neural networks (CGCNN, ALIGNN) learn features automatically
No uncertainty estimate	Cannot report confidence intervals for predictions	Bayesian linear regression, Gaussian process regression
Extrapolation risk	Predicting a material far outside the training distribution is unreliable	Always check applicability domain; flag predictions outside training range

Quick Check

1. In linear regression, what does the weight θ₂ = +1.29 for ΔEN tell you physically?

A. Increasing ΔEN decreases the band gap
B. Each unit increase in ΔEN raises the predicted band gap by 1.29 eV
C. The model has an R² of 1.29
D. There are 1.29 training examples per feature

2. You have 8 features and 9 training examples. Which regularisation technique should you prefer?

A. No regularisation — the Normal Equation is exact
B. Gradient descent with a large learning rate
C. Lasso or Ridge — to prevent overfitting with so few examples
D. Add more polynomial features first

3. Your linear regression gives training R² = 0.98 but test R² = 0.41. This is most likely:

A. Underfitting — model too simple
B. Overfitting — model memorised training noise
C. A correct result — always expect this gap
D. A data normalisation error

Foundations Linear Regression Band Gap Supervised Learning scikit-learn DFT

Header Ads Widget

Last Posts

Post 5: Linear Regression — Predicting Band Gap from Simple Features.

1. The Core Idea — A Function that Maps Structure to Property

2. How the Model Learns — Minimising the Cost Function

3. Gradient Descent — Learning Step by Step

4. Our Training Dataset — DFT Band Gaps of Binary Semiconductors

5. A Worked Example — Computing θ* by Hand

Step 1 — Build the design matrix X and label vector y

Step 2 — Apply the Normal Equation θ* = (XᵀX)⁻¹ Xᵀy

6. Evaluating the Model — R², MAE, and the Parity Plot

7. Underfitting and Overfitting — the Bias–Variance Trade-off

8. Regularisation — Ridge and Lasso for Materials Descriptors

9. Python Implementation — From Scratch and with scikit-learn

From scratch using NumPy

Using scikit-learn (production workflow)

10. Limitations — When Linear Regression Is Not Enough

Quick Check

About me

My page

Popular Posts

Post 1: What is Artificial Intelligence? A researcher's first look

Post 2: ML vs Traditional Simulation — Where does DFT end and ML begin?

Post 4: Types of ML — Supervised, Unsupervised & Reinforcement Learning

Post 7: Decision Trees and Random Forests — Handling Non-linear Boundaries

Post 3: Key Mathematical Tools — Vectors, Matrices & Probability

Post 6: Classification — is this material metallic or insulating?

Post 8: Support Vector Machines (SVMs) — with a crystal property example

Post 5: Linear Regression — Predicting Band Gap from Simple Features.

Categories

Pageviews past week

You may contact me here

Foundations

Core Algorithms

Magnetic Calculations

Menu Footer Widget