Header Ads Widget

AI & Machine Learning for Materials Sciences

Last Posts

10/recent/ticker-posts

Post 6: Classification — is this material metallic or insulating?

Post 6 — Classification: Is This Material a Metal or Insulator?

Your first binary classifier: logistic regression, the sigmoid function, and the decision boundary — applied to transition-metal compounds.

Module 2
🔀
Algorithm

Logistic Regression

🎯
Task

Supervised Classification

Materials Target

Metal vs. Insulator (Eg = 0 or > 0)

📊
Features

ΔEN, nd (d-electron count)

In Post 5 we trained a linear regression model to predict the band gap Eg as a continuous value. But materials scientists often need a simpler answer first: is this compound conducting or not? This yes/no question is a classification problem — the most common task in supervised machine learning.

🎯
What we will classify

Twelve transition-metal compounds as Metal (Eg = 0 eV) or Insulator (Eg > 0 eV) using two crystal-chemistry features: the electronegativity difference ΔEN and the formal d-electron count nd.

1. From Regression to Classification — A Shift in Output

Linear regression outputs a real number ŷ ∈ ℝ. Classification instead outputs a class label: 0 or 1. The simplest way to go from one to the other is to pass the linear combination through a function that squashes any real number into the [0, 1] range and interpret the result as a probability.

ModelOutputLoss functionDecision rule
Linear regression ŷ ∈ ℝ (any real number) MSE: (ŷ − y)²
Logistic regression P(metal) ∈ [0, 1] Binary cross-entropy If P ≥ 0.5 → Metal; else → Insulator
🔬
Why not just threshold the regression output?

You could set a threshold on the predicted Eg — but this ignores the structural difference between the two tasks. Classification requires a probabilistic output calibrated to class boundaries, not a raw energy value. Logistic regression is purpose-built for this.

2. The Sigmoid Function — Turning Scores into Probabilities

The key ingredient is the sigmoid (logistic) function σ(z), which maps any real-valued score z to a probability between 0 and 1.

🔀 The sigmoid function
σ(z) = 1 / (1 + e−z)

z → −∞ : σ(z) → 0 (strongly predicts Insulator)
z = 0 : σ(z) = 0.5 (maximum uncertainty, decision boundary)
z → +∞ : σ(z) → 1 (strongly predicts Metal)

The full logistic regression model for our two-feature classifier is:

⚡ Logistic regression model
z = θ₀ + θ₁ · ΔEN + θ₂ · nd
P(Metal | x) = σ(z) = 1 / (1 + e−z)

θ₀ = bias (intercept)
θ₁ = weight on electronegativity difference ΔEN
θ₂ = weight on d-electron count nd
nd = number of formal d-electrons (0–10)
💡
Physical intuition for the features

A large ΔEN means a more ionic bond — ionic compounds tend to be insulators (θ₁ < 0 expected). A partially-filled d-shell (intermediate nd) often correlates with metallic behaviour, but Mott insulators challenge this simple picture — exactly the kind of complexity a richer model must learn.

3. The Decision Boundary — Where the Model Is Maximally Uncertain

The model predicts Metal when σ(z) ≥ 0.5, which happens exactly when z ≥ 0. Setting z = 0 gives the equation of the decision boundary:

📐 Decision boundary in feature space
θ₀ + θ₁ · ΔEN + θ₂ · nd = 0

Rearranging for nd:
nd = −(θ₀ + θ₁ · ΔEN) / θ₂

This is a straight line in (ΔEN, nd) space separating Metal from Insulator predictions.

Logistic regression is therefore a linear classifier: its decision boundary is always a hyperplane (a line in 2D). This is a strength (interpretable, fast) and a limitation (cannot learn curved boundaries without feature engineering).

4. How the Model Learns — Binary Cross-Entropy

We cannot use MSE for classification because the sigmoid's output is a probability, and MSE would create a non-convex surface with many local minima. Instead we use binary cross-entropy, which is convex and measures how surprised the model is by the true label.

📉 Binary cross-entropy loss
J(θ) = −(1/N) · Σᵢ [ yᵢ log(ŷᵢ) + (1−yᵢ) log(1−ŷᵢ) ]

yᵢ = true label (1 = Metal, 0 = Insulator)
ŷᵢ = P(Metal | xᵢ) from the sigmoid

When yᵢ = 1: loss = −log(ŷᵢ) → penalises low confidence in Metal
When yᵢ = 0: loss = −log(1−ŷᵢ) → penalises high confidence in Metal
⚠️
No closed-form solution

Unlike linear regression, logistic regression has no Normal Equation. The cross-entropy cost has no analytical minimum — we must use gradient descent (or Newton's method). In practice, scikit-learn uses the L-BFGS solver by default, which converges much faster than vanilla gradient descent.

5. Gradient Descent for Logistic Regression

The gradient of the cross-entropy loss with respect to each weight has a surprisingly clean form — identical in structure to the MSE gradient for linear regression:

🔄 Gradient update rule
∂J/∂θⱼ = (1/N) · Σᵢ (ŷᵢ − yᵢ) · xᵢⱼ

θⱼ ← θⱼ − η · ∂J/∂θⱼ

η = learning rate (typical: 0.01–0.5 for standardised features)
Repeat until convergence (J stops decreasing).
  • Standardise features: subtract mean, divide by standard deviation
  • Initialise θ = [0, 0, 0]
  • Compute z = Xθ and ŷ = σ(z) for all training examples
  • Compute cross-entropy loss J(θ)
  • Compute gradient ∇J and update all weights θ ← θ − η · ∇J
  • Repeat steps 3–5 until J converges (typically 1,000–3,000 iterations)

6. Our Training Dataset — Transition-Metal Compounds

We use 12 transition-metal chalcogenides and oxides whose metallic or insulating character is well established experimentally. The two features (ΔEN, nd) are computed from standard atomic tables — no DFT required.

Compound ΔEN (Pauling) nd Class Physical note
FeS 0.436Metal Pyrrhotite-type conductor
FeS₂ 0.436InsulatorPyrite — band gap ~0.95 eV
NiO 1.408InsulatorClassic Mott insulator
NiS 0.438Metal Millerite-type metal
MnTe 0.615InsulatorAntiferromagnetic semiconductor
MnS 0.935InsulatorAlabandite — ~3 eV gap
CrO₂ 1.542Metal Half-metal, fully spin-polarised
Cr₂O₃1.543InsulatorChromia — ~3.4 eV gap
CoO 1.407InsulatorMott-Hubbard insulator
CoS₂ 0.437Metal Itinerant ferromagnet
TiO 1.542Metal NaCl-type metallic oxide
TiO₂ 1.540InsulatorRutile — ~3.0 eV gap
🧪
Why FeS and FeS₂ share the same ΔEN and nd?

This is the deep challenge of this dataset: FeS (metal) and FeS₂ (insulator) have identical formal features. FeS₂ forms S₂²⁻ dimers that open a gap through molecular-orbital effects — physics our two scalar features cannot encode. These two points will always be misclassified by a linear model, illustrating a hard limit of feature-engineered classifiers.

7. A Worked Example — Reading the Auto-Fit Result

After gradient descent converges on our 12-compound dataset (η = 0.5, standardised features, 3000 iterations), the fitted weights in raw (un-standardised) feature space are approximately:

✅ Fitted coefficients (gradient descent)
θ₀ = +1.92 (bias)
θ₁ = −2.81 (ΔEN weight)
θ₂ = +0.47 (nd weight)

P(Metal) = σ(1.92 − 2.81·ΔEN + 0.47·nd)
Physical interpretation check

θ₁ = −2.81 confirms: more ionic bonds (larger ΔEN) strongly reduce the probability of metallic character — physically correct (ionic compounds are typically insulators). θ₂ = +0.47 means a higher d-electron count mildly increases the metal probability, consistent with broader d-bands in late transition-metal sulfides.

8. Evaluating a Classifier — Confusion Matrix and Metrics

Classification models are never evaluated with R² or MAE. Instead we use the confusion matrix — a 2×2 count of correct and incorrect predictions — and four derived metrics.

MetricFormulaIdealWhat it tells you
Accuracy (TP + TN) / N 1.0 Fraction of all compounds classified correctly. Misleading on imbalanced datasets.
Precision TP / (TP + FP) 1.0 Of all compounds predicted Metal, what fraction truly are? Penalises false alarms.
Recall TP / (TP + FN) 1.0 Of all true Metals, what fraction did we detect? Penalises missed metals.
F1 score 2·Prec·Rec / (Prec + Rec) 1.0 Harmonic mean of precision and recall. Best single metric for imbalanced classes.
⚠️
Accuracy can lie

If 90% of your dataset is Insulator, a model that always predicts Insulator achieves 90% accuracy — but has zero ability to find metals. Always inspect both precision and recall, or use the F1 score.

9. Interactive Companion — App 6 Classification Explorer

The companion app lets you adjust the three parameters θ₀, θ₁, θ₂ manually and watch the decision boundary, sigmoid output, and confusion matrix update in real time. Click Auto-fit with gradient descent to run the full training loop. Notice that even after Auto-fit, accuracy stays at 10/12 at best — FeS and FeS₂ share identical (ΔEN, nd) coordinates and no linear boundary can separate them.

10. Python Implementation — scikit-learn Workflow

From scratch — gradient descent

import numpy as np

# Data: [ΔEN, n_d], labels (1=Metal, 0=Insulator)
X_raw = np.array([
    [0.43,6],[0.43,6],[1.40,8],[0.43,8],[0.61,5],[0.93,5],
    [1.54,2],[1.54,3],[1.40,7],[0.43,7],[1.54,2],[1.54,0]
])
y = np.array([1,0,0,1,0,0,1,0,0,1,1,0]) # 1=Metal

# Standardise
mx, sx = X_raw.mean(0), X_raw.std(0)
X = np.hstack([np.ones((12,1)), (X_raw - mx) / sx])

# Sigmoid
sigmoid = lambda z: 1 / (1 + np.exp(-z))

# Gradient descent
theta = np.zeros(3); eta = 0.5; iters = 3000
for _ in range(iters):
    yhat = sigmoid(X @ theta)
    grad = X.T @ (yhat - y) / len(y)
    theta -= eta * grad

print(f"θ = {theta.round(3)}")
preds = (sigmoid(X @ theta) >= 0.5).astype(int)
print(f"Accuracy: {(preds == y).mean():.2f}")

Using scikit-learn (recommended)

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

X = np.array([
    [0.43,6],[0.43,6],[1.40,8],[0.43,8],[0.61,5],[0.93,5],
    [1.54,2],[1.54,3],[1.40,7],[0.43,7],[1.54,2],[1.54,0]
])
y = np.array([1,0,0,1,0,0,1,0,0,1,1,0])

# Standardise features
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)

# Fit logistic regression
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(X_sc, y)

y_pred = clf.predict(X_sc)
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred,
      target_names=['Insulator','Metal']))
print(f"Boundary: n_d = {(-clf.intercept_[0]/clf.coef_[0,1]):.2f}"
      f" − {(clf.coef_[0,0]/clf.coef_[0,1]):.2f}·ΔEN")

11. Limitations — When Logistic Regression Is Not Enough

LimitationWhy it matters in materials scienceSolution
Linear boundary only Metal–insulator boundary in real materials is highly non-linear (Mott physics, topology) Kernel SVM, decision trees, neural networks
Feature overlap FeS and FeS₂ have identical ΔEN and nd but different classes Add structure-sensitive features: crystal field splitting, bond angle, coordination
Small dataset 12 examples is far too few for reliable generalisation Augment with ICSD/Materials Project data; use cross-validation
No uncertainty Cannot report confidence interval on the probability output Platt scaling, Bayesian logistic regression, conformal prediction
App 6 — Classification Explorer (standalone)
Open the full-page version to explore the decision boundary, sigmoid, and confusion matrix on a larger screen.
Open App →

Quick Check

1. In logistic regression, what does σ(z) = 0.5 tell you about the model's prediction?

  • A. The compound is 50% metal by weight
  • B. The model is maximally uncertain — this is exactly the decision boundary
  • C. The loss function is minimised at this point
  • D. Half the features are positive and half are negative

2. FeS and FeS₂ are always misclassified by our model. The most likely reason is:

  • A. The learning rate η is too large
  • B. Gradient descent did not converge
  • C. Both compounds have identical features in our two-dimensional feature space
  • D. The sigmoid function saturates for these inputs

3. You have 5 metals and 45 insulators in your dataset. Your model predicts all compounds as Insulator. What is its accuracy and F1 score for the Metal class?

  • A. Accuracy = 0.50, F1 = 0.50
  • B. Accuracy = 0.05, F1 = 0.00
  • C. Accuracy = 0.90, F1 = 0.00
  • D. Accuracy = 0.90, F1 = 0.50
Core Algorithms Classification Logistic Regression Sigmoid Confusion Matrix Metal vs Insulator scikit-learn