- Monoco Quant Insights
- Posts
- Fork in the Code
Fork in the Code
Understanding the Core Mathematics Behind Decision Trees
Somewhere between the cool corridors of Stanford’s computer science labs and the sunlit co-working spaces of SoMa, a curious evolution has been quietly unfolding. A new generation of startup executives is no longer satisfied with simply invoking the magic of AI—they want to understand it. Not just the elevator pitch or the product roadmap, but the beating mathematical heart beneath it all.
And perhaps nowhere is this tension—between abstraction and intuition, between code and conviction—more visible than in the decision tree.
It’s the kind of algorithm that feels disarmingly human. Like a manager with a clipboard making logical decisions. “Is the customer under 25? No? Are they a repeat buyer? Yes? Offer the premium bundle.” It's logic wrapped in the velvet glove of probability—and beneath that glove lies a clean mathematical structure that, if you squint just right, looks a lot like common sense.
But this common sense, it turns out, is the result of a century of intellectual ambition.
A Brief History: From Mahalanobis to McKinsey
To appreciate the decision tree’s place in the AI pantheon, we must first take a detour into the statistical salons of the 20th century. In 1936, Indian polymath P.C. Mahalanobis developed a technique for classifying populations—an early whiff of what would become modern machine learning. Fast-forward to the 1960s, when researchers at IBM and Stanford began to formalize classification rules with binary splits.
By the 1980s, with computing power catching up to ambition, algorithms like ID3 and its successor C4.5 (coined by Ross Quinlan) brought decision trees into the mainstream. Their allure? No black box. These were models you could explain at a board meeting.
In the McKinsey world, where storytelling is as important as analysis, this was gold. Consultants could map decisions as branching strategies—precisely what decision trees do. In fact, many digital transformation frameworks, like the MECE principle (Mutually Exclusive, Collectively Exhaustive), bear an eerie resemblance to the clean splits of a decision tree.
The Executive’s Framework: Three Pillars of Decision Trees
Let’s put this in a framework—the language of strategy. Decision trees rest on three mathematical pillars:
1. Splitting Criteria: How to Make the Cut
At the heart of every decision tree is a question: where do we split the data?
The most common answer is the Gini Impurity or Information Gain. Imagine you’re segmenting customers into those who churn and those who don’t. The goal is to find the feature (say, “days since last login”) that best separates the two groups.
Gini Impurity: Measures how mixed the data is. If a node is 50% churners and 50% loyalists, Gini is high (bad). If it’s 100% one class, Gini is zero (ideal).
Information Gain: Based on entropy (yes, that same entropy from thermodynamics). It measures how much uncertainty is reduced after a split.
In boardroom terms: you're looking for the cleanest break. The one that sharpens your strategic segmentation.
2. Recursive Partitioning: Divide and Conquer
Once you've found a good split, you repeat the process. This is recursion—the algorithmic equivalent of McKinsey’s 80/20 rule applied again and again.
Mathematically, the decision tree builds a binary tree:
At each node, it evaluates all possible features.
At each leaf, it assigns a class label or value.
The tree stops growing when:
All data in a node belongs to one class.
The tree reaches a max depth.
A minimum number of samples are left to split.
Think of this like scenario planning: starting from a strategic decision (“enter new market?”) and branching out into a series of contingent choices, risks, and responses.
3. Pruning: Avoiding the Overfit Trap
Like any ambitious executive, a decision tree left unchecked will overdo it.
It will keep splitting until it fits every last quirk of the training data, like a consultant who tailors a deck so tightly to the client’s biases that it loses generality. This is called overfitting.
The mathematical fix? Pruning. You trim back the tree—remove nodes that don't add real predictive value. Statistically, this involves:
Cost complexity pruning (also known as weakest link pruning), which trades off between tree complexity and accuracy.
Cross-validation, testing tree performance on unseen data.
McKinsey would call this “pressure-testing the hypothesis.”
Real Example: From Churn Prediction to Credit Risk
Let’s say you're building a SaaS startup and you want to predict which customers will churn.
You start with features like login frequency, support tickets opened, and product usage.
The decision tree might first split on login frequency ("< 3 times/week?").
Then, within that group, it might split on support ticket sentiment ("Negative sentiment?").
Eventually, the tree might say: “If a customer logs in less than 3 times per week and has opened more than 2 negative support tickets in the past month, they have an 85% chance of churning.”
The power here isn’t just prediction—it’s narrative. You can walk your product team through each decision path and say, “These are the warning signs.”
Banks use the same principle for credit scoring. Decision trees have powered models that determine loan approval by walking through income brackets, employment history, and debt-to-income ratios. Again: explainability is key.
Where We Go From Here
The modern decision tree has evolved into ensembles—random forests and gradient boosted trees—that use the same principles but layer them into collective judgments. It’s no longer one voice at the table; it’s a panel of advisors voting on each decision.
But the mathematics remain rooted in something beautifully simple: asking the right questions at the right branch.
For the AI executive racing from product reviews to investor calls, this clarity matters. Because behind every “intelligent” model is a scaffolding of logic, probability, and recursion. Knowing that lets you lead with more than instinct—it lets you lead with understanding.