Header Ads Widget

AI & Machine Learning for Materials Sciences

Last Posts

10/recent/ticker-posts

Post 7: Decision Trees and Random Forests — Handling Non-linear Boundaries

Go beyond straight lines: learn how decision trees split feature space into rectangles, why single trees overfit, and how random forests fix this through bagging and feature randomness — applied to transition-metal compounds.

Module 2
🌳
Algorithm

Decision Tree & Random Forest

🎯
Task

Classification & Regression

🔬
Materials Target

Metal / Semiconductor / Insulator

📊
Key Concept

Ensemble learning, bagging, feature importance

Logistic regression (Post 6) draws a single straight line to separate metals from insulators. But nature does not respect straight lines — the metal–insulator boundary in transition-metal compounds is shaped by competing mechanisms (bandwidth, Hubbard U, charge transfer, crystal field) that create highly non-linear, irregular boundaries in feature space. Decision trees solve this by asking a sequence of yes/no questions about the features, carving feature space into axis-aligned rectangles. Random forests then aggregate hundreds of these trees to achieve low variance without sacrificing flexibility.

🌳
What we will build

A decision tree and a random forest to classify 12 transition-metal compounds as Metal, Semiconductor, or Insulator, using ΔEN, nd, and Eg as features. We will visualise the tree structure, plot feature importances, and compare single-tree vs. ensemble accuracy.

1. Why Logistic Regression Is Not Enough

Consider FeS (metal) and NiO (Mott insulator): both have similar d-electron counts but radically different electronic characters driven by the on-site Coulomb repulsion U. A straight-line boundary in (ΔEN, nd) space cannot separate them correctly without also misclassifying other compounds. What we need is a model that can ask conditional questions: "If ΔEN > 1.2 AND nd ≤ 3, predict Metal." That is exactly what a decision tree does.

PropertyLogistic RegressionDecision TreeRandom Forest
Decision boundary Linear hyperplane only Axis-aligned rectangles Smooth ensemble of rectangles
Interpretability Coefficients (θ) Visual tree + rules Feature importances
Overfitting risk Low High (deep trees) Low (bagging)
Handles non-linearity No Yes Yes
Feature scaling needed Yes No No

2. How a Decision Tree Works — Recursive Partitioning

A decision tree is built top-down by recursive binary splitting. At each node, the algorithm finds the feature and threshold that best separates the classes. "Best" is measured by an impurity criterion:

🌿 Gini impurity (used by scikit-learn)
Gini(node) = 1 − Σk pk²

pk = fraction of samples in class k at this node
Gini = 0 → perfectly pure node (all one class)
Gini = 0.5 → maximally impure (50/50 split, binary)
🔀 Information gain (alternative: entropy)
Entropy(node) = −Σk pk · log₂(pk)

Information Gain = Entropy(parent) − weighted average Entropy(children)
Split chosen = argmax over all features and thresholds of Information Gain
  • Start at the root with all training compounds
  • For each feature (ΔEN, nd, Eg) and every possible threshold, compute Gini impurity of the two resulting child nodes
  • Choose the feature + threshold that minimises the weighted average Gini of the children
  • Split the data into left (≤ threshold) and right (> threshold) subsets
  • Repeat recursively on each child until a stopping criterion is met (max depth, min samples, or pure leaf)
  • Each leaf node predicts the majority class of its training compounds
🔬
Physical analogy

A decision tree is a formalised version of the crystal-chemist's rule book: "If ΔEN > 1.2, the bond is mostly ionic → likely insulator. Else if nd is 1–4, d-band is partially filled → likely metal." The tree learns these thresholds automatically from data instead of from textbooks.

3. The Overfitting Problem — Why Deep Trees Fail

A tree grown without any depth limit will memorise the training set perfectly — it creates one leaf per training sample and achieves 100% training accuracy. But this tree generalises poorly: it has learned noise, not patterns. The gap between training accuracy and test accuracy is called variance.

⚠️
Bias–variance trade-off for trees

Shallow tree (max_depth = 1–2): high bias, low variance — underfits, misses real patterns.
Deep tree (max_depth = None): low bias, high variance — overfits, memorises noise.
Optimal depth is found by cross-validation on held-out data.

4. Random Forests — Taming Variance with Ensembles

A random forest builds N independent decision trees, each trained on a different random subsample of the training data (bagging = bootstrap aggregating), and at each split considers only a random subset of √p features (feature randomness). The final prediction is the majority vote (classification) or mean (regression) across all trees.

🌲 Random forest prediction
ŷ = majority_vote{ T₁(x), T₂(x), …, TN(x) }

Ti = i-th decision tree, trained on bootstrap sample Bi
At each split, only m = √p features considered (p = total features)
Out-of-bag (OOB) samples ≈ 37% of data not in Bi → free validation set
💡
Why does this work?

Each tree is a high-variance model, but the trees are decorrelated by feature randomness — they make different errors. Averaging decorrelated high-variance models reduces variance without increasing bias, a result formalised by the bias–variance decomposition of ensemble methods.

5. Feature Importance — What the Forest Learned

Random forests provide a natural measure of feature importance: for each feature, sum the total Gini impurity decrease contributed by all splits on that feature, averaged over all trees. Features used earlier (closer to the root) and more often accumulate higher importance scores.

📊 Mean decrease in Gini impurity (MDI)
Importance(f) = (1/N) · Σtrees Σnodes using f Δ Gini(node)

Δ Gini(node) = Gini(parent) − [nL/n · Gini(left) + nR/n · Gini(right)]
Scores are normalised to sum to 1.
🧪
Physical interpretation for our dataset

We expect Eg to rank first (it directly encodes the class), followed by ΔEN (ionic character → insulating tendency), then nd (d-band filling). If ΔEN ranks above Eg, the model has found a useful composition-based predictor — important for predicting new compounds where Eg is not yet known.

6. Our Dataset — Three-Class Classification

We extend the Post 6 binary dataset to three classes: Metal (Eg = 0), Semiconductor (0 < Eg ≤ 2 eV), and Insulator (Eg > 2 eV).

Compound ΔEN nd Eg (eV) Class
FeS 0.4360.00Metal
FeS₂ 0.4360.95Semiconductor
NiO 1.4083.10Insulator
NiS 0.4380.00Metal
MnTe 0.6151.40Semiconductor
MnS 0.9352.30Insulator
CrO₂ 1.5420.00Metal
Cr₂O₃1.5432.80Insulator
CoO 1.4072.40Insulator
CoS₂ 0.4370.00Metal
TiO 1.5420.00Metal
TiO₂ 1.5403.00Insulator

7. Python Implementation

Decision Tree

# ── Post 7: Decision Tree ──────────────────────────────
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import classification_report
import numpy as np

# Features: [ΔEN, n_d, Eg] Labels: 0=Metal, 1=Semiconductor, 2=Insulator
X = np.array([
    [0.43,6,0.00], [0.43,6,0.95], [1.40,8,3.10], [0.43,8,0.00],
    [0.61,5,1.40], [0.93,5,2.30], [1.54,2,0.00], [1.54,3,2.80],
    [1.40,7,2.40], [0.43,7,0.00], [1.54,2,0.00], [1.54,0,3.00]
])
y = np.array([0,1,2,0,1,2,0,2,2,0,0,2])
names = ['Metal', 'Semiconductor', 'Insulator']

# Train with max_depth=3 to avoid overfitting
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X, y)

# Print human-readable tree
print(export_text(dt, feature_names=['ΔEN','n_d','Eg']))
print(classification_report(y, dt.predict(X), target_names=names))

Random Forest + Feature Importances

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import LeaveOneOut, cross_val_score

rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    random_state=42,
    oob_score=True
)
rf.fit(X, y)

print(f"OOB accuracy : {rf.oob_score_:.2f}")

# Feature importances
for name, imp in zip(['ΔEN','n_d','Eg'], rf.feature_importances_):
    print(f" {name:<6} {imp:.3f} {'█' * int(imp*40)}")

# Leave-one-out cross-validation
loo_scores = cross_val_score(rf, X, y, cv=LeaveOneOut())
print(f"LOO accuracy : {loo_scores.mean():.2f}")

8. Reading the Fitted Tree — Physics in the Branches

A typical depth-3 tree fitted on our 12 compounds produces rules like:

🌳 Learned decision rules (depth = 3)
Root: Eg ≤ 0.475 eV ?
  ├─ YES → Metal (FeS, NiS, CrO₂, CoS₂, TiO)
  └─ NO → Eg ≤ 1.875 eV ?
      ├─ YES → Semiconductor (FeS₂, MnTe)
      └─ NO → ΔEN ≤ 1.2 ?
          ├─ YES → Insulator (MnS)
          └─ NO → Insulator (NiO, Cr₂O₃, CoO, TiO₂)
Physical check

The first split on Eg ≤ 0.475 eV is the cleanest possible separator — it directly reflects the definition of metallic behaviour. The second split at 1.875 eV separates small-gap semiconductors from wide-gap insulators. The final split on ΔEN distinguishes ionic from covalent insulators — physically meaningful.

9. Limitations and When to Use What

ConcernDecision TreeRandom Forest
Overfitting Severe without depth limit Controlled by bagging
Interpretability Full tree readable Feature importance only
Small datasets OK with pruning Works but OOB estimate unreliable
Extrapolation Cannot extrapolate beyond training range Same limitation
Feature scaling Not needed Not needed
⚠️
MDI feature importance can mislead

Mean Decrease in Impurity (MDI) tends to inflate the importance of high-cardinality or continuous features (like Eg) relative to binary or low-range features. For more reliable importances, use permutation importance (sklearn.inspection.permutation_importance) on a held-out set.

🌳
App 7 — Decision Tree & Random Forest Explorer
Grow a tree interactively, tune max_depth, watch the decision regions update, and see feature importances recalculate live on our 12-compound dataset.
Open App →

Quick Check

1. A decision tree with no depth limit achieves 100% training accuracy. What does this most likely indicate?

  • A. The model has learned the true physical rules perfectly
  • B. The tree has overfit — it memorised the training data including noise
  • C. The dataset has no class overlap
  • D. The Gini impurity is maximised at every node

2. What is the purpose of using only √p random features at each split in a random forest?

  • A. To reduce training time by avoiding all features
  • B. To decorrelate the individual trees so their errors do not all point in the same direction
  • C. To ensure the forest only uses the most important feature
  • D. To perform feature selection before training

3. In our dataset, Eg ranks as the top feature by MDI importance. Why might this be physically misleading for predicting new unknown compounds?

  • A. Eg is always zero for metals
  • B. Eg is the label itself (or directly encodes it) — using it as a feature leaks target information; for truly unknown compounds Eg is not available
  • C. The Gini criterion cannot handle continuous features
  • D. MDI always ranks the last feature highest
Core Algorithms Decision Tree Random Forest Gini Impurity Feature Importance Bagging Non-linear Classification