- Monoco Quant Insights
- Posts
- The Algorithm That Split the World
The Algorithm That Split the World
From Branching Decisions to Gradient-Boosted Empires—How a Simple Idea Transformed AI
In the fall of 1963, Ross Quinlan was not yet famous. The Australian computer scientist wasn’t even focused on artificial intelligence. His early work in information retrieval would only later give rise to a deceptively simple structure that would become one of the most powerful tools in machine learning: the decision tree.
It began innocuously enough: how can one make decisions based on a sequence of yes/no questions? The kind of logic a child might use to guess an animal ("Is it bigger than a breadbox?"). But embedded in that logic was a deeper structure—one that would eventually help banks underwrite billions of dollars in loans, power fraud detection systems, and propel Kaggle competitors to the top of the leaderboard.
This is the story of the decision tree—from its dusty academic roots to its modern-day renaissance in gradient boosting frameworks like XGBoost and LightGBM. It is a story not just of algorithmic evolution, but of how ideas, when shaped by real-world complexity, splinter, deepen, and then… bloom.
The Splitting Principle: A Simple Math Engine
At its core, a decision tree asks: Which variable do I split on to best separate my data into clean classes?
Suppose you’re building a tree to predict whether a startup will succeed. You have features: funding size, team size, founder experience. For each feature, the tree considers all possible split points and evaluates how well the split purifies the outcomes.
This purification is measured via impurity metrics such as:
Gini Impurity:
G = 1 - sum(p_i
^2) for all classWhere
p_i
is the proportion of examples in classi
at a given node.Entropy (Information Gain):
H = - sum(p_i
* log2(p_i
)) for all classes i
Each candidate split is scored based on the reduction in impurity, known as Information Gain. The process repeats recursively: pick the best split, create child nodes, and keep going until stopping criteria are met (max depth, min samples, etc.).
While conceptually elegant, the approach overfits easily. Enter pruning. A savvy algorithm trims back the overzealous branches, using techniques like cost-complexity pruning, to generalize better on unseen data.
A Framework for Thinking About Evolution
To navigate the historical development, let’s borrow from strategy consultants and organize the decision tree’s evolution into three eras:
Foundational Era (1963–1986) – Decision trees as tools for classification.
Ensemble Era (1990–2010) – Boosting and bagging to combat overfitting and variance.
Optimization Era (2014–present) – Speed and scale through gradient optimization.
This framework helps us understand not just what changed, but why. Each step reflects a tension in AI’s grand arc: simplicity vs. accuracy, generality vs. performance, interpretability vs. power.
I. Foundations: ID3, CART, and the Birth of Branching Logic
In 1986, Ross Quinlan formally introduced ID3 (Iterative Dichotomiser 3). It used information gain and built trees in a top-down, greedy fashion. The approach, however, was limited to categorical features and couldn't handle missing data gracefully.
Around the same time, CART (Classification and Regression Trees) was born, thanks to Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Unlike ID3, CART could handle both classification and regression, and it introduced Gini Impurity and binary splits.
Use Case Snapshot:
A 1990s telecom firm used CART to predict customer churn by learning decision boundaries from call durations, complaints, and service usage. The simplicity allowed domain experts to validate decisions—an early nod to explainability.
But even with these innovations, trees were brittle. Enter the next phase.
II. Ensemble Renaissance: The Wisdom of Crowds
No single tree was perfect. But what if you grew many trees, each a little different, and aggregated their predictions?
This idea matured in the 1990s:
Bagging (Bootstrap Aggregation): Proposed by Breiman in 1996, bagging trains multiple trees on bootstrapped samples of the data and averages their outputs to reduce variance.
Random Forests (2001): Breiman struck again. By adding randomness in feature selection during splits, random forests further decorrelated trees, yielding remarkable robustness.
Meanwhile, another revolution brewed in the realm of boosting:
AdaBoost (1995): Freund and Schapire’s invention sequentially added trees, each focusing on examples misclassified by the previous. Trees were weak, but the ensemble was strong.
Use Case Snapshot:
In credit scoring, AdaBoost models outperformed logistic regression by learning subtle feature interactions—how, say, recent job changes and mid-level loan amounts could signal risk in tandem.
Still, these models struggled with large data. Then came the modern era.
III. Optimization Era: The Age of Gradient Boosting
In 2014, Tianqi Chen, then a PhD student at the University of Washington, released XGBoost (Extreme Gradient Boosting)—an optimized, scalable implementation of gradient boosted decision trees.
Its advantages were staggering:
Parallelized tree construction
Regularization to avoid overfitting
Support for sparse data
Built-in cross-validation
Gradient Boosting refines boosting by treating it as a gradient descent problem in function space:
Let the goal be to minimize a loss function L(y,F(x))L(y, F(x))L(y,F(x)) over predictions F(x)F(x)F(x). Each new tree fits to the negative gradient of the loss with respect to F(x)F(x)F(x), iteratively improving predictions.

Since then, even more efficient variants have emerged:
LightGBM: Grows trees leaf-wise rather than level-wise, with histogram-based binning.
CatBoost: Handles categorical variables more gracefully and avoids target leakage.
Use Case Snapshot:
At Airbnb, gradient boosted trees helped optimize pricing by capturing non-linear patterns across user behavior, seasonality, and listing quality—patterns linear models simply missed.
Decision Trees Timeline & Key Papers
Year | Model | Contribution | Key Paper / Link |
---|---|---|---|
1963 | Decision Tables | Early rule-based learning | |
1986 | ID3 | Info gain-based decision trees | |
1984 | CART | Regression support, Gini impurity | |
1995 | AdaBoost | Sequential boosting of weak learners | |
1996 | Bagging | Bootstrap aggregation | |
2001 | Random Forests | Ensemble of random trees | |
2014 | XGBoost | Fast gradient boosting | |
2017 | LightGBM | Leaf-wise boosting, histograms | |
2018 | CatBoost | Native handling of categories |
The Future: Trees in a Neural World
In a deep learning world dominated by transformer architectures, decision trees persist. Why? Because they still outperform deep nets on small to medium structured data. And because they are—at heart—transparent.
In healthcare, fintech, and regulation-heavy industries, decision trees remain the interpreter’s best friend.
The future may not be a single tree or forest, but a hybrid jungle—neural-symbolic blends where trees explain, while nets intuit. After all, the startup executive looking to harness AI doesn’t just want power. They want control.
And it all starts with a single split.
References & Further Reading: