Machine Learning - The Mathematics of Decision Trees and Random Forests - Part 1
Regression Trees
Views and opinions expressed are solely my own.
Introduction
The purpose of this post is to discuss the mathematics of decision trees and random forests. We first begin by discussing regression trees.
Regression Trees
Suppose we have training data \((\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)\) with \(\mathbf{x}_i \in R \subset \mathbb{R}^p\) and \(y_i \in \mathbb{R}\). We call \(R\) the predictor space. We partition \(R\) into (non-overlapping, distinct) regions \(R_1, \dots, R_J\) in \(\mathbb{R}^p\). We estimate the response using \[\begin{equation*} \hat{f}(\mathbf{x}) = \sum_{j=1}^{J}c_j \cdot \mathbf{I}(\mathbf{x} \in R_j) \end{equation*}\] where \(c_1, \dots, c_J\) are constants, and \(\mathbf{I}\) is the indicator function. That is, in each region, we assign a single predicted value. The loss function we aim to minimize is \[\begin{equation*} L(\mathbf{c}) = \sum_{i=1}^{n}\left[y_i - \hat{f}(\mathbf{x}_i) \right]^2 = \sum_{i=1}^{n}\left[y_i - \sum_{j=1}^{J}c_j \cdot \mathbf{I}(\mathbf{x}_i \in R_j) \right]^2 \text{.} \end{equation*}\] The partial derivative of \(L\) with respect to \(c_k\) is given by \[\begin{align*} \dfrac{\partial L}{\partial c_k} &= -2\sum_{i=1}^{n}\left[y_i - \sum_{j=1}^{J}c_j \cdot \mathbf{I}(\mathbf{x}_i \in R_j) \right]\mathbf{I}(\mathbf{x}_i \in R_k) \\ &= -2\left[\sum_{i=1}^{n}y_i \cdot \mathbf{I}(\mathbf{x}_i \in R_k) - \sum_{i=1}^{n}\sum_{j=1}^{J}c_j \cdot \mathbf{I}(\mathbf{x}_i \in R_j)\mathbf{I}(\mathbf{x}_i \in R_k)\right] \\ &= -2\left[\sum_{i=1}^{n}y_i \cdot \mathbf{I}(\mathbf{x}_i \in R_k) - \sum_{i=1}^{n}\sum_{j=1}^{J}c_j \cdot \mathbf{I}(\mathbf{x}_i \in R_j \cap R_k)\right]\text{.} \end{align*}\] Now consider the sum \[\begin{equation*} \sum_{j=1}^{J}c_j \cdot \mathbf{I}(\mathbf{x}_i \in R_j \cap R_k)\text{.} \end{equation*}\] Since the regions \(R_1, \dots, R_J\) are non-overlapping, \(\mathbf{I}(\mathbf{x}_i \in R_j \cap R_k) = 0\) when \(j \neq k\). Thus \[\begin{equation*} \sum_{j=1}^{J}c_j \cdot \mathbf{I}(\mathbf{x}_i \in R_j \cap R_k) = c_k \cdot \mathbf{I}(\mathbf{x}_i \in R_k \cap R_k) = c_k \cdot \mathbf{I}(\mathbf{x}_i \in R_k)\text{.} \end{equation*}\] Lastly, the sum \[\begin{equation*} \sum_{i=1}^{n}c_k \cdot \mathbf{I}(\mathbf{x}_i \in R_k) = c_k \sum_{i=1}^{n}\mathbf{I}(\mathbf{x}_i \in R_k) = c_kn_k \end{equation*}\] where \(n_k\) is the number of \(\mathbf{x}_i\) in region \(R_k\). Hence we have \[\begin{equation*} \dfrac{\partial L}{\partial c_k} = -2\left[\sum_{i=1}^{n}y_i \cdot \mathbf{I}(\mathbf{x}_i \in R_k) - c_kn_k\right]\text{.} \end{equation*}\] Setting the above equal to \(0\), we obtain the optimal solution for \(c_k\) is \[\begin{equation*} \hat{c}_k = \dfrac{1}{n_k}\sum_{i=1}^{n}y_i \cdot \mathbf{I}(\mathbf{x}_i \in R_k) = \dfrac{1}{n_k}\sum_{\{i: \mathbf{x}_i \in R_k\}}y_i\text{.} \end{equation*}\] This may be simply summarized as follows: the best prediction for data in the same region with respect to their \(\mathbf{x}_i\) values is the average of their \(y_i\) values.
Construction of the Regions \(R_1, \dots, R_J\)
For the sake of computation, it is infeasible to make the regions \(\{R_j\}_{j=1}^{J}\) arbitrary subsets of \(R\). Thus, we construct these regions using an algorithm known as recursive binary splitting, which is a top-down, greedy algorithm; i.e.
- it is top-down since it begins with the entirety of the prediction space \(R\) and then splits it, and
- it is greedy because at each step at which \(R\) is split, the best split is made at that particular step.1
How the algorithm works is as follows: write \(R\) as the Cartesian product \[\begin{equation*} R = A_1 \times A_2 \times \cdots \times A_p\text{,} \end{equation*}\] where each \(A_j \subset \mathbb{R}\). Let \(s \in \mathbb{R}\). Then for each \(j\), define the half-planes \[\begin{align*} &R_{1}(j, s) = \{R \mid A_j < s\} \\ &R_{2}(j, s) = \{R \mid A_j \geq s\} \end{align*}\] in \(\mathbb{R}\). We then seek to find the values of \(j\) and \(s\) that solve the problem \[\begin{equation*} \min_{j, s}\left[\min_{c_1}\sum_{\{i: \mathbf{x}_i \in R_1(j, s)\}}(y_i - c_1)^2 + \min_{c_2}\sum_{\{i: \mathbf{x}_i \in R_2(j, s)\}}(y_i - c_2)^2 \right]\text{.} \end{equation*}\] Similar to what was shown in the previous section, the optimal value of \(c_1\) is the average of \(y_i\) whose \(\mathbf{x}_i\) are in \(R_1(j, s)\), and the optimal value of \(c_2\) is the average of \(y_i\) whose \(\mathbf{x}_i\) are in \(R_2(j, s)\). Assuming \(p\) isn’t too large, we can then find the optimal values for \(j\) and \(s\) (which can be time consuming).
After identifying the two regions, we then choose one of those regions to split and repeat this process based on a predictor and a cutpoint and repeat this process until we reach a stopping criterion.
The reason why we refer to this algorithm as a decision tree is because we have a tree of regions, starting with the predictor space \(R\) and then choosing subsets of \(R\) appropriately as we go through each step.
Pruning: dealing with overfitting
The procedure described above will likely lead to overfitting because the trees may be extremely complex, making the procedure difficult to generalize to new (test) data. Choose some value \(\alpha \geq 0\). Suppose we have a large tree \(T_0\) that we wish to prune (i.e., create fewer regions by collapsing) into a subtree \(T \subset T_0\). Let \(|T|\) be the number of terminal nodes (excluding the topmost one of the tree). Then we find \(T\) which minimizes \[\begin{equation*} \sum_{m=1}^{|T|}\sum_{\{i: \mathbf{x}_i \in R_m\}}(y_i - \hat{c}_m)^2 + \alpha|T| \end{equation*}\] which, from what can be observed, is a loss function with regularization. If \(\alpha = 0\), we just simply have the original tree \(T_0\).
Random Forests (from regression trees)
Instead of using the training data of size \(n\), one can create \(B\) bootstrap samples of size \(n\) from the training data, create a regression tree, find an appropriate subtree, and output an ensemble of trees \(\{T_b\}_{b=1}^{B}\). This is called a random forest. One may average the predictions coming out of trees of a random forest to yield the random forest predictor for an input \(\mathbf{x} \in \mathbb{R}^p\). Details are laid out in Sections 15.2-15.3 of Hastie, Tibshirani, and Friedman.
References
Hastie, T., Tisbshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). New York, NY: Springer Science+Business Media, LLC.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. New York, NY: Springer Science+Business Media.
See also Wikipedia’s explanation of greedy algorithms.↩︎