Go beyond straight lines: learn how decision trees split feature space into rectangles, why single trees overfit, and how random forests fix this through bagging and feature randomness — applied to transition-metal compounds.
Decision Tree & Random Forest
Classification & Regression
Metal / Semiconductor / Insulator
Ensemble learning, bagging, feature importance
Logistic regression (Post 6) draws a single straight line to separate metals from insulators. But nature does not respect straight lines — the metal–insulator boundary in transition-metal compounds is shaped by competing mechanisms (bandwidth, Hubbard U, charge transfer, crystal field) that create highly non-linear, irregular boundaries in feature space. Decision trees solve this by asking a sequence of yes/no questions about the features, carving feature space into axis-aligned rectangles. Random forests then aggregate hundreds of these trees to achieve low variance without sacrificing flexibility.
A decision tree and a random forest to classify 12 transition-metal compounds as Metal, Semiconductor, or Insulator, using ΔEN, nd, and Eg as features. We will visualise the tree structure, plot feature importances, and compare single-tree vs. ensemble accuracy.
1. Why Logistic Regression Is Not Enough
Consider FeS (metal) and NiO (Mott insulator): both have similar d-electron counts but radically different electronic characters driven by the on-site Coulomb repulsion U. A straight-line boundary in (ΔEN, nd) space cannot separate them correctly without also misclassifying other compounds. What we need is a model that can ask conditional questions: "If ΔEN > 1.2 AND nd ≤ 3, predict Metal." That is exactly what a decision tree does.
| Property | Logistic Regression | Decision Tree | Random Forest |
|---|---|---|---|
| Decision boundary | Linear hyperplane only | Axis-aligned rectangles | Smooth ensemble of rectangles |
| Interpretability | Coefficients (θ) | Visual tree + rules | Feature importances |
| Overfitting risk | Low | High (deep trees) | Low (bagging) |
| Handles non-linearity | No | Yes | Yes |
| Feature scaling needed | Yes | No | No |
2. How a Decision Tree Works — Recursive Partitioning
A decision tree is built top-down by recursive binary splitting. At each node, the algorithm finds the feature and threshold that best separates the classes. "Best" is measured by an impurity criterion:
pk = fraction of samples in class k at this node
Gini = 0 → perfectly pure node (all one class)
Gini = 0.5 → maximally impure (50/50 split, binary)
Information Gain = Entropy(parent) − weighted average Entropy(children)
Split chosen = argmax over all features and thresholds of Information Gain
- Start at the root with all training compounds
- For each feature (ΔEN, nd, Eg) and every possible threshold, compute Gini impurity of the two resulting child nodes
- Choose the feature + threshold that minimises the weighted average Gini of the children
- Split the data into left (≤ threshold) and right (> threshold) subsets
- Repeat recursively on each child until a stopping criterion is met (max depth, min samples, or pure leaf)
- Each leaf node predicts the majority class of its training compounds
A decision tree is a formalised version of the crystal-chemist's rule book: "If ΔEN > 1.2, the bond is mostly ionic → likely insulator. Else if nd is 1–4, d-band is partially filled → likely metal." The tree learns these thresholds automatically from data instead of from textbooks.
3. The Overfitting Problem — Why Deep Trees Fail
A tree grown without any depth limit will memorise the training set perfectly — it creates one leaf per training sample and achieves 100% training accuracy. But this tree generalises poorly: it has learned noise, not patterns. The gap between training accuracy and test accuracy is called variance.
Shallow tree (max_depth = 1–2): high bias, low variance — underfits, misses real patterns.
Deep tree (max_depth = None): low bias, high variance — overfits, memorises noise.
Optimal depth is found by cross-validation on held-out data.
4. Random Forests — Taming Variance with Ensembles
A random forest builds N independent decision trees, each trained on a different random subsample of the training data (bagging = bootstrap aggregating), and at each split considers only a random subset of √p features (feature randomness). The final prediction is the majority vote (classification) or mean (regression) across all trees.
Ti = i-th decision tree, trained on bootstrap sample Bi
At each split, only m = √p features considered (p = total features)
Out-of-bag (OOB) samples ≈ 37% of data not in Bi → free validation set
Each tree is a high-variance model, but the trees are decorrelated by feature randomness — they make different errors. Averaging decorrelated high-variance models reduces variance without increasing bias, a result formalised by the bias–variance decomposition of ensemble methods.
5. Feature Importance — What the Forest Learned
Random forests provide a natural measure of feature importance: for each feature, sum the total Gini impurity decrease contributed by all splits on that feature, averaged over all trees. Features used earlier (closer to the root) and more often accumulate higher importance scores.
Δ Gini(node) = Gini(parent) − [nL/n · Gini(left) + nR/n · Gini(right)]
Scores are normalised to sum to 1.
We expect Eg to rank first (it directly encodes the class), followed by ΔEN (ionic character → insulating tendency), then nd (d-band filling). If ΔEN ranks above Eg, the model has found a useful composition-based predictor — important for predicting new compounds where Eg is not yet known.
6. Our Dataset — Three-Class Classification
We extend the Post 6 binary dataset to three classes: Metal (Eg = 0), Semiconductor (0 < Eg ≤ 2 eV), and Insulator (Eg > 2 eV).
| Compound | ΔEN | nd | Eg (eV) | Class |
|---|---|---|---|---|
| FeS | 0.43 | 6 | 0.00 | Metal |
| FeS₂ | 0.43 | 6 | 0.95 | Semiconductor |
| NiO | 1.40 | 8 | 3.10 | Insulator |
| NiS | 0.43 | 8 | 0.00 | Metal |
| MnTe | 0.61 | 5 | 1.40 | Semiconductor |
| MnS | 0.93 | 5 | 2.30 | Insulator |
| CrO₂ | 1.54 | 2 | 0.00 | Metal |
| Cr₂O₃ | 1.54 | 3 | 2.80 | Insulator |
| CoO | 1.40 | 7 | 2.40 | Insulator |
| CoS₂ | 0.43 | 7 | 0.00 | Metal |
| TiO | 1.54 | 2 | 0.00 | Metal |
| TiO₂ | 1.54 | 0 | 3.00 | Insulator |
7. Python Implementation
Decision Tree
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import classification_report
import numpy as np
# Features: [ΔEN, n_d, Eg] Labels: 0=Metal, 1=Semiconductor, 2=Insulator
X = np.array([
[0.43,6,0.00], [0.43,6,0.95], [1.40,8,3.10], [0.43,8,0.00],
[0.61,5,1.40], [0.93,5,2.30], [1.54,2,0.00], [1.54,3,2.80],
[1.40,7,2.40], [0.43,7,0.00], [1.54,2,0.00], [1.54,0,3.00]
])
y = np.array([0,1,2,0,1,2,0,2,2,0,0,2])
names = ['Metal', 'Semiconductor', 'Insulator']
# Train with max_depth=3 to avoid overfitting
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X, y)
# Print human-readable tree
print(export_text(dt, feature_names=['ΔEN','n_d','Eg']))
print(classification_report(y, dt.predict(X), target_names=names))
Random Forest + Feature Importances
from sklearn.model_selection import LeaveOneOut, cross_val_score
rf = RandomForestClassifier(
n_estimators=500,
max_features='sqrt',
random_state=42,
oob_score=True
)
rf.fit(X, y)
print(f"OOB accuracy : {rf.oob_score_:.2f}")
# Feature importances
for name, imp in zip(['ΔEN','n_d','Eg'], rf.feature_importances_):
print(f" {name:<6} {imp:.3f} {'█' * int(imp*40)}")
# Leave-one-out cross-validation
loo_scores = cross_val_score(rf, X, y, cv=LeaveOneOut())
print(f"LOO accuracy : {loo_scores.mean():.2f}")
8. Reading the Fitted Tree — Physics in the Branches
A typical depth-3 tree fitted on our 12 compounds produces rules like:
├─ YES → Metal (FeS, NiS, CrO₂, CoS₂, TiO)
└─ NO → Eg ≤ 1.875 eV ?
├─ YES → Semiconductor (FeS₂, MnTe)
└─ NO → ΔEN ≤ 1.2 ?
├─ YES → Insulator (MnS)
└─ NO → Insulator (NiO, Cr₂O₃, CoO, TiO₂)
The first split on Eg ≤ 0.475 eV is the cleanest possible separator — it directly reflects the definition of metallic behaviour. The second split at 1.875 eV separates small-gap semiconductors from wide-gap insulators. The final split on ΔEN distinguishes ionic from covalent insulators — physically meaningful.
9. Limitations and When to Use What
| Concern | Decision Tree | Random Forest |
|---|---|---|
| Overfitting | Severe without depth limit | Controlled by bagging |
| Interpretability | Full tree readable | Feature importance only |
| Small datasets | OK with pruning | Works but OOB estimate unreliable |
| Extrapolation | Cannot extrapolate beyond training range | Same limitation |
| Feature scaling | Not needed | Not needed |
Mean Decrease in Impurity (MDI) tends to inflate the importance of high-cardinality
or continuous features (like Eg) relative to binary or low-range features.
For more reliable importances, use permutation importance
(sklearn.inspection.permutation_importance) on a held-out set.
Quick Check
1. A decision tree with no depth limit achieves 100% training accuracy. What does this most likely indicate?
2. What is the purpose of using only √p random features at each split in a random forest?
3. In our dataset, Eg ranks as the top feature by MDI importance. Why might this be physically misleading for predicting new unknown compounds?