The traditional simulation workflow
You already know this workflow deeply. It starts with a crystal structure and ends with a physical property — but every step is grounded in quantum mechanics:
This is a physics-first approach. The rules (Schrödinger equation, exchange-correlation functional) are known in advance. No data is needed — only the structure. The calculation is deterministic: the same input always produces the same output.
DFT solves the problem from first principles — no experimental data, no training. It can predict properties of a material that has never been synthesised. This is its greatest strength.
The machine learning workflow
The ML workflow looks very different. Instead of solving equations, it learns patterns from existing data:
The ML model does not solve the Schrödinger equation. It does not know quantum mechanics. It has simply seen thousands of (structure → property) pairs and learned to interpolate. Once trained, prediction takes milliseconds instead of hours.
An ML model can only predict reliably for materials similar to those in its training set. It cannot extrapolate safely beyond its training distribution. DFT has no such limitation — it works for any material.
Side-by-side comparison
| Property | DFT / Ab initio | Machine Learning |
|---|---|---|
| Physical basis | ✓ Quantum mechanics | ✗ Statistical patterns |
| Requires training data | ✓ No — only structure | ✗ Yes — thousands of examples |
| Prediction speed | Slow — hours to days | Fast — milliseconds |
| Accuracy | High (within XC error) | Variable — depends on training |
| Extrapolation | ✓ Any material | ✗ Similar to training data only |
| Interpretability | High — physics-based | Low — often a black box |
| Screening 10,000 compounds | ✗ Computationally impossible | ✓ Done in seconds |
| New compound (never seen) | ✓ No problem | ⚠️ Risky — uncertain reliability |
The real workflow: DFT feeds ML
In modern computational materials science, DFT and ML are not competitors — they are partners in the same pipeline. DFT generates the reliable, physics-based data that ML needs to train on.
Step 1 — Generate data: Run DFT on hundreds or thousands of known structures. Calculate band gaps, formation energies, elastic constants.
Step 2 — Train ML model: Use the DFT results as labels. Train a model (e.g. ALIGNN, CGCNN) to predict these properties from structure.
Step 3 — Screen: Use the ML model to rapidly screen 100,000 candidate compounds. Cost: seconds.
Step 4 — Validate: Take the top 10–20 candidates identified by ML. Run full DFT calculations to verify. Cost: manageable.
This four-step loop has already led to the discovery of new battery materials, catalysts, and topological insulators — at a speed impossible with DFT alone.
A concrete example: band gap prediction
Suppose you want to find a semiconductor with a band gap between 1.0 and 1.5 eV (optimal for solar cells) from a database of 100,000 candidate structures.
Each DFT calculation takes ~4 hours on 16 cores. For 100,000 structures: 400,000 core-hours. Equivalent to 45 years on a single computer. Clearly impossible.
Run DFT on 10,000 structures (40,000 core-hours — feasible). Train an ML model. Screen all 100,000 structures with ML in minutes. Run DFT only on the top 50 candidates. Total cost: ~40,200 core-hours. Speedup: ~10×.
When to use each approach
| Situation | Best approach |
|---|---|
| Studying one new material in depth | DFT — maximum accuracy, full physics |
| Screening thousands of candidates | ML — after training on DFT data |
| Unknown compound class (no training data) | DFT — ML cannot extrapolate safely |
| High-throughput property prediction | ML — orders of magnitude faster |
| Validating ML predictions | DFT — always use DFT for final validation |
| Understanding electronic structure | DFT — DOS, band structure, Fermi surface |
Think of ML as a fast surrogate model trained on DFT. It approximates DFT at a fraction of the cost — but DFT remains the ground truth. Never trust an ML prediction without DFT validation for critical decisions.
What makes a good ML model for materials?
Not all ML models are equal for materials science. A good model must respect the fundamental symmetries of physics:
| Physical symmetry | What it means | Example |
|---|---|---|
| Rotational invariance | Rotating a crystal changes nothing physical | Band gap of Fe₃O₄ is the same in any orientation |
| Translational invariance | Shifting the origin changes nothing physical | Properties don't depend on where you place the unit cell |
| Permutation invariance | Order of atoms in the list doesn't matter | Swapping atom 1 and atom 2 in the input gives the same property |
Graph Neural Networks like CGCNN and ALIGNN — which you may have encountered — are designed specifically to respect these symmetries. This is why they outperform simple neural networks for materials property prediction. We will study them in detail in Module 3.