Post 7: Decision Trees and Random Forests — Handling Non-linear Boundaries

Module 2 · Core Algorithms

Go beyond straight lines: learn how decision trees split feature space into rectangles, why single trees overfit, and how random forests fix this through bagging and feature randomness — applied to transition-metal compounds.

Abderrahmane REGGAD · June 15, 2026

Module 2

🌳

Algorithm

Decision Tree & Random Forest

🎯

Task

Classification & Regression

🔬

Materials Target

Metal / Semiconductor / Insulator

📊

Key Concept

Ensemble learning, bagging, feature importance

Logistic regression (Post 6) draws a single straight line to separate metals from insulators. But nature does not respect straight lines — the metal–insulator boundary in transition-metal compounds is shaped by competing mechanisms (bandwidth, Hubbard U, charge transfer, crystal field) that create highly non-linear, irregular boundaries in feature space. Decision trees solve this by asking a sequence of yes/no questions about the features, carving feature space into axis-aligned rectangles. Random forests then aggregate hundreds of these trees to achieve low variance without sacrificing flexibility.

🌳

What we will build

A decision tree and a random forest to classify 12 transition-metal compounds as Metal, Semiconductor, or Insulator, using ΔEN, n_d, and E_g as features. We will visualise the tree structure, plot feature importances, and compare single-tree vs. ensemble accuracy.

1. Why Logistic Regression Is Not Enough

Consider FeS (metal) and NiO (Mott insulator): both have similar d-electron counts but radically different electronic characters driven by the on-site Coulomb repulsion U. A straight-line boundary in (ΔEN, n_d) space cannot separate them correctly without also misclassifying other compounds. What we need is a model that can ask conditional questions: "If ΔEN > 1.2 AND n_d ≤ 3, predict Metal." That is exactly what a decision tree does.

Property	Logistic Regression	Decision Tree	Random Forest
Decision boundary	Linear hyperplane only	Axis-aligned rectangles	Smooth ensemble of rectangles
Interpretability	Coefficients (θ)	Visual tree + rules	Feature importances
Overfitting risk	Low	High (deep trees)	Low (bagging)
Handles non-linearity	No	Yes	Yes
Feature scaling needed	Yes	No	No

2. How a Decision Tree Works — Recursive Partitioning

A decision tree is built top-down by recursive binary splitting. At each node, the algorithm finds the feature and threshold that best separates the classes. "Best" is measured by an impurity criterion:

🌿 Gini impurity (used by scikit-learn)

Gini(node) = 1 − Σ_k p_k²

p_k = fraction of samples in class k at this node
Gini = 0 → perfectly pure node (all one class)
Gini = 0.5 → maximally impure (50/50 split, binary)

🔀 Information gain (alternative: entropy)

Entropy(node) = −Σ_k p_k · log₂(p_k)

Information Gain = Entropy(parent) − weighted average Entropy(children)
Split chosen = argmax over all features and thresholds of Information Gain

Start at the root with all training compounds
For each feature (ΔEN, n_d, E_g) and every possible threshold, compute Gini impurity of the two resulting child nodes
Choose the feature + threshold that minimises the weighted average Gini of the children
Split the data into left (≤ threshold) and right (> threshold) subsets
Repeat recursively on each child until a stopping criterion is met (max depth, min samples, or pure leaf)
Each leaf node predicts the majority class of its training compounds

🔬

Physical analogy

A decision tree is a formalised version of the crystal-chemist's rule book: "If ΔEN > 1.2, the bond is mostly ionic → likely insulator. Else if n_d is 1–4, d-band is partially filled → likely metal." The tree learns these thresholds automatically from data instead of from textbooks.

3. The Overfitting Problem — Why Deep Trees Fail

A tree grown without any depth limit will memorise the training set perfectly — it creates one leaf per training sample and achieves 100% training accuracy. But this tree generalises poorly: it has learned noise, not patterns. The gap between training accuracy and test accuracy is called variance.

⚠️

Bias–variance trade-off for trees

Shallow tree (max_depth = 1–2): high bias, low variance — underfits, misses real patterns.
Deep tree (max_depth = None): low bias, high variance — overfits, memorises noise.
Optimal depth is found by cross-validation on held-out data.

4. Random Forests — Taming Variance with Ensembles

A random forest builds N independent decision trees, each trained on a different random subsample of the training data (bagging = bootstrap aggregating), and at each split considers only a random subset of √p features (feature randomness). The final prediction is the majority vote (classification) or mean (regression) across all trees.

🌲 Random forest prediction

ŷ = majority_vote{ T₁(x), T₂(x), …, T_N(x) }

T_i = i-th decision tree, trained on bootstrap sample B_i
At each split, only m = √p features considered (p = total features)
Out-of-bag (OOB) samples ≈ 37% of data not in B_i → free validation set

💡

Why does this work?

Each tree is a high-variance model, but the trees are decorrelated by feature randomness — they make different errors. Averaging decorrelated high-variance models reduces variance without increasing bias, a result formalised by the bias–variance decomposition of ensemble methods.

5. Feature Importance — What the Forest Learned

Random forests provide a natural measure of feature importance: for each feature, sum the total Gini impurity decrease contributed by all splits on that feature, averaged over all trees. Features used earlier (closer to the root) and more often accumulate higher importance scores.

📊 Mean decrease in Gini impurity (MDI)

Importance(f) = (1/N) · Σ_trees Σ_{nodes using f} Δ Gini(node)

Δ Gini(node) = Gini(parent) − [n_L/n · Gini(left) + n_R/n · Gini(right)]
Scores are normalised to sum to 1.

🧪

Physical interpretation for our dataset

We expect E_g to rank first (it directly encodes the class), followed by ΔEN (ionic character → insulating tendency), then n_d (d-band filling). If ΔEN ranks above E_g, the model has found a useful composition-based predictor — important for predicting new compounds where E_g is not yet known.

6. Our Dataset — Three-Class Classification

We extend the Post 6 binary dataset to three classes: Metal (E_g = 0), Semiconductor (0 < E_g ≤ 2 eV), and Insulator (E_g > 2 eV).

Compound	ΔEN	n_d	E_g (eV)	Class
FeS	0.43	6	0.00	Metal
FeS₂	0.43	6	0.95	Semiconductor
NiO	1.40	8	3.10	Insulator
NiS	0.43	8	0.00	Metal
MnTe	0.61	5	1.40	Semiconductor
MnS	0.93	5	2.30	Insulator
CrO₂	1.54	2	0.00	Metal
Cr₂O₃	1.54	3	2.80	Insulator
CoO	1.40	7	2.40	Insulator
CoS₂	0.43	7	0.00	Metal
TiO	1.54	2	0.00	Metal
TiO₂	1.54	0	3.00	Insulator

7. Python Implementation

Decision Tree

# ── Post 7: Decision Tree ──────────────────────────────

from sklearn.tree import DecisionTreeClassifier, export_text

from sklearn.metrics import classification_report

import numpy as np

# Features: [ΔEN, n_d, Eg]   Labels: 0=Metal, 1=Semiconductor, 2=Insulator

X = np.array([

    [0.43,6,0.00], [0.43,6,0.95], [1.40,8,3.10], [0.43,8,0.00],

    [0.61,5,1.40], [0.93,5,2.30], [1.54,2,0.00], [1.54,3,2.80],

    [1.40,7,2.40], [0.43,7,0.00], [1.54,2,0.00], [1.54,0,3.00]

])

y = np.array([0,1,2,0,1,2,0,2,2,0,0,2])

names = ['Metal', 'Semiconductor', 'Insulator']

# Train with max_depth=3 to avoid overfitting

dt = DecisionTreeClassifier(max_depth=3, random_state=42)

dt.fit(X, y)

# Print human-readable tree

print(export_text(dt, feature_names=['ΔEN','n_d','Eg']))

print(classification_report(y, dt.predict(X), target_names=names))

Random Forest + Feature Importances

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import LeaveOneOut, cross_val_score

rf = RandomForestClassifier(

    n_estimators=500,

    max_features='sqrt',

    random_state=42,

    oob_score=True

)

rf.fit(X, y)

print(f"OOB accuracy : {rf.oob_score_:.2f}")

# Feature importances

for name, imp in zip(['ΔEN','n_d','Eg'], rf.feature_importances_):

    print(f"  {name:<6} {imp:.3f}  {'█' * int(imp*40)}")

# Leave-one-out cross-validation

loo_scores = cross_val_score(rf, X, y, cv=LeaveOneOut())

print(f"LOO accuracy : {loo_scores.mean():.2f}")

8. Reading the Fitted Tree — Physics in the Branches

A typical depth-3 tree fitted on our 12 compounds produces rules like:

🌳 Learned decision rules (depth = 3)

Root: E_g ≤ 0.475 eV ?
  ├─ YES → Metal (FeS, NiS, CrO₂, CoS₂, TiO)
  └─ NO → E_g ≤ 1.875 eV ?
      ├─ YES → Semiconductor (FeS₂, MnTe)
      └─ NO → ΔEN ≤ 1.2 ?
          ├─ YES → Insulator (MnS)
          └─ NO → Insulator (NiO, Cr₂O₃, CoO, TiO₂)

✅

Physical check

The first split on E_g ≤ 0.475 eV is the cleanest possible separator — it directly reflects the definition of metallic behaviour. The second split at 1.875 eV separates small-gap semiconductors from wide-gap insulators. The final split on ΔEN distinguishes ionic from covalent insulators — physically meaningful.

9. Limitations and When to Use What

Concern	Decision Tree	Random Forest
Overfitting	Severe without depth limit	Controlled by bagging
Interpretability	Full tree readable	Feature importance only
Small datasets	OK with pruning	Works but OOB estimate unreliable
Extrapolation	Cannot extrapolate beyond training range	Same limitation
Feature scaling	Not needed	Not needed

⚠️

MDI feature importance can mislead

Mean Decrease in Impurity (MDI) tends to inflate the importance of high-cardinality or continuous features (like E_g) relative to binary or low-range features. For more reliable importances, use permutation importance (sklearn.inspection.permutation_importance) on a held-out set.

Quick Check

1. A decision tree with no depth limit achieves 100% training accuracy. What does this most likely indicate?

A. The model has learned the true physical rules perfectly
B. The tree has overfit — it memorised the training data including noise
C. The dataset has no class overlap
D. The Gini impurity is maximised at every node

2. What is the purpose of using only √p random features at each split in a random forest?

A. To reduce training time by avoiding all features
B. To decorrelate the individual trees so their errors do not all point in the same direction
C. To ensure the forest only uses the most important feature
D. To perform feature selection before training

3. In our dataset, E_g ranks as the top feature by MDI importance. Why might this be physically misleading for predicting new unknown compounds?

A. E_g is always zero for metals
B. E_g is the label itself (or directly encodes it) — using it as a feature leaks target information; for truly unknown compounds E_g is not available
C. The Gini criterion cannot handle continuous features
D. MDI always ranks the last feature highest

Core Algorithms Decision Tree Random Forest Gini Impurity Feature Importance Bagging Non-linear Classification

Header Ads Widget

Last Posts

Post 7: Decision Trees and Random Forests — Handling Non-linear Boundaries

1. Why Logistic Regression Is Not Enough

2. How a Decision Tree Works — Recursive Partitioning

3. The Overfitting Problem — Why Deep Trees Fail

4. Random Forests — Taming Variance with Ensembles

5. Feature Importance — What the Forest Learned

6. Our Dataset — Three-Class Classification

7. Python Implementation

Decision Tree

Random Forest + Feature Importances

8. Reading the Fitted Tree — Physics in the Branches

9. Limitations and When to Use What

Quick Check

About me

My page

Popular Posts

Post 1: What is Artificial Intelligence? A researcher's first look

Post 2: ML vs Traditional Simulation — Where does DFT end and ML begin?

Post 4: Types of ML — Supervised, Unsupervised & Reinforcement Learning

Post 7: Decision Trees and Random Forests — Handling Non-linear Boundaries

Post 3: Key Mathematical Tools — Vectors, Matrices & Probability

Post 6: Classification — is this material metallic or insulating?

Post 8: Support Vector Machines (SVMs) — with a crystal property example

Post 5: Linear Regression — Predicting Band Gap from Simple Features.

Categories

Pageviews past week

You may contact me here

Foundations

Core Algorithms

Magnetic Calculations

Menu Footer Widget