Lecture 5: Trees and forests

Munir Hiabu

Trees and forests

In this lecture we will look at the nonparametric regression problem with squared loss \(L(y_1,y_2)=(y_1-y_2)^2\) and also the classification problem with Binary loss function \(L(y_1,y_2)=\Bbb 1(y_1\neq y_2)\) or squared loss.
We already know that the Bayes-rule is
- \(m^\ast(x)=\mathbb E[Y|X=x]\) for the squared loss,
- \(m^\ast(x)=\underset{k=1,\dots,K}{\operatorname{argmax}} \mathbb P(Y=k|X=x)\) for the binary loss.

Trees and forests Decision trees

Decision trees are estimators based on a partition of the feature space \(\mathcal X\), \(E[Y|X=x]=m^\ast(x)=\sin(2\pi x_1)\).

The tree estimate is a piecewise constant function.
The constant areas of the tree estimate are called leaves or terminal nodes.
The leaves partition the feature space \(\mathcal X:\)
- the union of all leaves is equal \(\mathcal X\).
- leaves are non-overlapping, i.e, their interception is empty.
The value within a leave is given by averaging the responses (regression with \(L_2\) loss) or majority vote (classification with binary loss).

Trees and forests Tree representation

Decision trees are popular because the decision making (estimate) is nicely visualizable/interpretable if not too deep:
Left and right picture below are two different representation of the same tree estimate.
The interior edges on the right plot are called nodes and can also be interpreted as subset of \(\mathcal X\)

Trees and forests Tree: Classification example

Decision trees can be quite efficient in classification tasks where we are interested in 0-1 (binary) decisions.
- This is the case in many application but is often not the case in insurance

Trees and forests Definition of a tree

Definition (Tree)

Let \((\mathcal X, \mathcal F)\) and \((\mathcal Y, \mathcal E)\) be measurable spaces and consider the supervised learning problem corresponding to the product space \((\mathcal X \times \mathcal Y, \mathcal F \otimes \mathcal E, P)\) where \(P\) is some probability measure.
A tree model \(T\) consists of some finite system \(\mathcal B \subseteq \mathcal F\) of \(\mathcal F\)-measurable sets and a function \(f: \mathcal B \smallsetminus \{\mathcal X\} \rightarrow \mathcal B\) that fulfill the following conditions:

\(\mathcal X \in \mathcal B\) (Root node)
\(\forall B_1, B_2 \in \mathcal B : B_1 \cap B_2 = \emptyset \vee B_1 \subseteq B_2 \vee B_2 \subseteq B_1\) (Nodes cannot partially overlap)
\(\forall B_1, B_2 \in \mathcal B : B_1 \subseteq B_2 \Rightarrow B_2 \smallsetminus B_1 \in \mathcal B\) (Splitting a node always produces two nodes)
\[ \forall B \in \mathcal B \exists d_B \in \mathbb N_{0} : f^{d_B}(B) := \underbrace{f \circ \cdots \circ f}_{d_B \text{ times}}(B) = \mathcal X \] We say that \(d_B\) is the depth of the node \(B\).

Trees and forests Definition of a tree

Definition (Leaves)

Given a tree model \(T = (\mathcal B, f)\), we say that \[ \mathcal A = \mathcal A(T) = \{A \in \mathcal B \mid f^{-1}(\{A\}) = \emptyset\} \] is the set of leaves (or terminal nodes) for the tree.

Trees and forests Definition of a tree

Definition (Tree estimator)

\(T\) tree model
Data \(\mathcal D_n = (X_i, Y_i)_{i = 1, \ldots, n}\).

In the case of regression, the emprical squared loss-minimizing decision function produced by the tree model is \[ \hat{m}_n(x) = \sum_{A \in \mathcal A} \mathbb 1{x \in A}\left(\frac{\sum_{i=1}^n \mathbb 1\{X_i \in A\}Y_i}{\sum_{i=1}^n \mathbb 1\{X_i \in A\}}\right). \] In the case of classification, the empirical classification error minimizing decision function is \[ \hat{m}_n(x) = \sum_{A \in \mathcal A} \mathbb 1\{x \in A\}\cdot \textrm{argmax}_{y \in \mathcal Y}\left(\sum_{i=1}^n \mathbb 1\{X_i \in A\}\mathbb 1\{Y_i = y\}\right). \]

Remark

Note that a decision function stemming from a decision tree is uniquely defined by the leaves of the tree.

Trees and forests How to learn a decision tree: CART

A decision tree is uniquely defined by its leaves.
The naive approach of looking for leaves that minimize a certain loss (and ignoring the tree structure) is practically not feasible because of the computational cost.
Instead: top-down greedy approach, called CART (classification and regression trees).
We now describe how to construct nodes via the CART algorithm.

CART (heuristic)

Start with the whole space \(\mathcal X\) (=root node).
Pick variable \(j=1,\dots,p\) and split-point \(c \in \textrm{supp}(X_j)\) that gives the best loss improvement.
Split \(\mathcal X\) into two nodes according to variable and split point.
Next, if no stopping rule applies, both nodes (otherwise only one or none) are split into two further nodes.
Continue until stopping rule applies to all nodes.

Trees and forests How to learn a decision tree: CART

CART algorithm (start)

Start with \(\mathcal X\) as initial node for splitting.

CART algorithm (splitting)

Input: node \(B\subseteq \mathcal X\)
For every dimension \(j=1,\dots,p\) and every point \(s_j\in \{x_j| \exists\ x_1,\dots,x_{j-1},x_{j+1},\dots,x_p\ \text{with} (x_1,\dots,x_p)\in B\}\) define \[B_1(j,s_j)=\{(x_1,\dots,x_p)\in B| \ x_j \leq s_j\}, \quad B_2(j,s_j)=\{(x_1,\dots,x_p)\in B| \ x_j > s_j\}.\]
We pick the minimizer \[ (j^\ast,s^\ast)= \arg\min_{j,s} Q_n(B_1(j,s_j)) + Q_n(B_2(j,s_j)) \]
Each of the new nodes \((B_1(j^\ast,s^\ast),B_2(j^\ast,s^\ast))\) is further split as long as stopping criterion does not apply.

Trees and forests How to learn a decision tree: CART

CART Algorithm: the loss \(Q\)

For regression the most common loss function is the squared loss:

Squared loss: \(Q_n(B_l)= \sum_{i: X_i \in B_l} (Y_i - \overline Y_i(B_l))^2\)
- \(\overline Y_i(R_l)= \frac 1 {|R_l|}\sum_{i: X_i \in B_l}Y_i,\) \(|B_j|=\#\{i: X_i\in R_j\}\).

In the classification case, define \(\hat p_{lk}= \frac 1 {|B_l|} \sum_{i: X_i \in B_l} \mathbb 1(Y_i=k)\) (=the proportion of class \(k\) observation in \(B_l\))

Common loss functions are:

Brier score: \(Q_n(B_l)= \sum_{i: X_i \in B_l} (Y_i - \overline Y_i(B_l))^2\) (=squared loss)
Misclassification error: \(Q_n(B_l)= 1 - \max_{k}\hat p_{lk}\)
Gini index: \(Q_n(B_l)= \sum_{k=1}^K \hat p_{lk}(1-\hat p_{lk})\)
Entropy : \(Q_n(B_l)=-\sum_{k=1}^K \hat p_{lk}\log\hat p_{lk}\)

Note: In classification, even if binary loss is the ultimate goal, Gini index and entropy might be the better choices for splitting.

Trees and forests How to learn a decision tree: CART

CART algorithm (stopping criterion)

The most common stopping criterion is specifying a min-node size, i.e. stop if \(|B_j|<c\). A common choice is \(c=5\).
An alternative stopping criteria is the depth of \(B_j\), i.e., the number of parent nodes of \(B_j\).

Trees and forests How to learn a decision tree: CART, Pruning

The process described so far will probably lead to an overfit.
One may think that one way out is to stop growing the tree early enough.
- This is not advisable because stopping the growing early because the current split does not lead to great improvement in the loss does not mean that a split afterwards might not turn very effective.
In the problem below, the first split will not lead to big improvements in the loss We will need at least 2 (probably more) splits until we arrive at a good decision rule.

Trees and forests How to learn a decision tree: CART, Pruning

The idea of pruning is to let a tree grow very deep first (= small min node size) and then select a subtree (pruning) as final fit.
Idea: We can compare different subtrees via a penalized loss.

Definition (Subtree)

Consider a tree model \(T = (\mathcal B, f)\). We say that a tree model \(T' = (\mathcal B', f')\) is a sub-tree if \(B' \subseteq \mathcal B\) and \(f = f'\) on \(\mathcal B'\). Note that we may abuse the notation and write \((\mathcal B', f)\) instead of \((\mathcal B', f')\).

We can derive a subtree, by pruning a tree at any non-terminal node (i.e. by deleting all descendants of that node)

Trees and forests How to learn a decision tree: CART, Pruning

Definition (\(\alpha\)-pruned tree)

Consider a tree \(\hat T_n\) with corresponding estimator \(\hat m_n^{\hat T_n}\). Given a parameter \(\alpha>0\), the \(\alpha\)-pruned tree is \[ \hat T_{n,\alpha} := \arg \min_{T\subseteq \hat T_n}\{\hat R_n(\hat m_n) + \alpha |T|\}, \] where \(|T|\) is the number of leaves of \(T\).

How do we find an optimal \(\alpha\) (and also the optimal subtree for a fixed \(\alpha\))?
- brute force cross-validation will be too computationally intensive even if we know the answer to the second question.
- The answer to both questions is the weakest link algorithm.

Trees and forests How to learn a decision tree: CART, Pruning

Proposition (weakest link, see Breiman (1984))

\(\hat T_{n,\alpha}\) is unique. Furthermore, there are trees \(T_0\supset \cdots\supset T_l\) with \(\{T_0,\dots, T_l\}=\{\hat T_{n,\alpha} : \alpha>0\}\); in particular the set \(\{\hat T_{n,\alpha} : \alpha>0\}\) is finite. Let \(Q_n\) be the loss function corresponding to \(\hat R_n\). The trees \(T_0,\dots, T_l\) can be found with the following algorithm

Let \(B_L,B_R\) be any two terminal nodes in \(\hat T_n\) resulting from a split of the immediate ancestor node \(t\). If \(Q_n(t) =Q_n(B_L)+Q_n(B_R)\), prune off \(B_L\) and \(B_R\). Continue this process until no more pruning is possible. The resulting tree is \(T_0\), \(\alpha_0=0\).
\(k=0\)
Go through all non-terminal nodes \(B\) of \(T_k\) and calculate \[ g(B)=\frac{Q_n(B)-Q_n(S_t)}{|S_t|-1}, \] where \(S_t\) is the tree (branch) with root \(B\) and nodes consisting of \(B\) and the descendants of \(B\) in \(T_k\). \(Q_n(S_t)=\sum_{A \ \text{leaf of}\ S_t} Q_n(A)\).
\(\alpha_k= \min_{B \ \text{ non-terminal node of} \ T_k}g(B)\).
From top to bottom in \(T_k\) prune all non-terminal nodes with \(g(B)=\alpha_k\) and call the resulting tree \(T_{k+1}\).
If \(T_{k+1}\) has more than one node, set \(k \leftarrow k+1\) and go to step 3.

Trees and forests How to learn a decision tree: CART, Pruning

\(g(B)\) is the improvement-per-leaf ratio for that particular node.
If we were to prune all children for some node \(B\),
- the loss would increase by exactly \(g(B) \cdot (|\{A| \ A \ \text{is leaf and child of} \ B\}| - 1)\)
- and because of the penalty the loss would decrease by \(\alpha \cdot (|\{A| \ A \ \text{is leaf and child of} \ B\}| - 1)\).
- In total the change is \((g(B)-\alpha)\cdot (|\{A| \ A \ \text{is leaf and child of} \ B\}| - 1)\)
If \(\alpha\) is smaller than all \(g(B)\) then nothing needs to be pruned.
The weakest link algorithm deletes the node with smallest \(g(B)\) and the corresponding \(\alpha\) is set to that \(g(B)\).

Trees and forests How to learn a decision tree: CART, Pruning

Given \(l\) trees \(\{T_0,\dots, T_l\}=\{\hat T_{n,\alpha} : \alpha>0\}\), we can pick the optimal tree via cross validation.

Problem with decision trees

There are (at least) two major problems with decision trees

Instability: Small variations in the data can lead to a very different tree
Performance: The performance (measured via test error) is usually much weaker compared to other learning algorithms

Trees and forests Problem of trees: Instability

Below you can see 9 pruned decision trees (optimal \(\alpha\) determined via cross-validation) where data is different each time but from the same generating mechanism:
- Estimates are based on 300 iid observations with \(Y\) given by \[ \textrm{sign}\left\{(\mathbb 1[(X_1>0.5\&X_2>0.5) \ \text{or} \ (X_1<0.5\&X_2<0.5)]\times2-1) \times \varepsilon\right\}, \] \(X_1\sim N(0,1), X_2\sim N(0,1), \varepsilon \sim N(0,0.5^2)\).

Trees and forests Bagging

Bagging stands for Bootstrap aggregation.
Idea: Generate artificial data via bootstrap, build a decision tree for each data-set and average.
Hope: It reduces the variance of decision trees.

Definition (Bagging)

Given data \({(X_1,Y_1),\dots, (X_n,Y_n)}\), draw \(B\) bootstrap samples \(\tilde {\mathcal D}^{(b)}_n={(\tilde X^{(b)}_1,\tilde Y^{(b)}_1),\dots, (\tilde X^{(b)}_n,\tilde Y^{(b)}_n)}, b=1,\dots, B\), i.e., \(\tilde {\mathcal D}^{(b)}_n\) arises from \(\mathcal D_n\) by drawing \(n\) times with replacement.
The bagging estimator is defined as
- squared loss \(\rightarrow\) average: \(\hat m^{bagg}_n=\frac 1 B \sum_{b=1}^B \hat m_n(\tilde {\mathcal D}^{(b)}_n)\)
- binary loss \(\rightarrow\) majority vote: \(\hat m^{bagg}_n(x)=\arg \max_{k\in {1,\dots K}} \#\{b: m_n(\tilde {\mathcal D}^{(b)}_n) (x)=k\}\)
Note that \(\hat m^{bagg}_n\) is a stochastic estimator, even given \(\mathcal D_n\).
Note: Usually, bagging is applied on non-pruned trees. (See it as an alternative way to reduce variance)

Trees and forests Random forests

While bagging improves performance of decision trees, interpretability is worsened.
If interpretability is not a concern, then bagging is still inferior compared to other learning techniques.
Random forests are a modification on bagging estimators.

Definition (Random forest)

As bagging random forest are an ensemble of decision trees.
They add the following modification to bagging:
- \(\texttt{mtry}\): Each time a node is split it is only done on a subset of size \(\texttt{mtry}\) of viable variables. I.e., each time you want to split a node, turn on a random generator and draw \(1\leq\texttt{mtry}<p\) viable variables from \(\{1,\dots,p\}\). Only those variables drawn are allowed to be considered for the next split.
We denote the random forest estimator by \(\hat m^{rf}_n\). Note that \(\hat m^{rf}_n\), as \(\hat m^{bagg}_n\), is a stochastic estimator, even given \(\mathcal D_n\).

Trees and forests Random forests

Why is \(\texttt{mtry}\) so useful?

Assume that we are given \(l\) estimators \(m_n(\mathcal D_n, U_1), \dots m_n(\mathcal D_n, U_l)\)
Here, \(U_1,\dots, U_l\) are iid variables inducing the additional stochasticity of the estimators (e.g. through random forest or bagging).
Assume that \(\hat m_n(\mathcal D_n, U_1), \dots \hat m_n(\mathcal D_n, U_l)\) have pairwise correlation \(\rho\) and variance \(\sigma^2\).
We get \[\begin{aligned} \textrm{Var}\left (\frac 1 l \sum_{j=1}^l \hat m_n(\mathcal D_n, U_j)\right)&= \frac 1 {l^2} \sum_{jk} \textrm{Cov}(m_n(\mathcal D_n, U_j),m_n(\mathcal D_n, U_k))\\&=\frac 1 {l^2} \left( l\sigma^2+ (l^2-l)\rho\sigma^2\right)\\&= \rho \sigma^2 + \frac{1-\rho}{l}\sigma^2 \end{aligned}\]
Meaning: Variance of an ensemble of trees is small the smaller the correlation between the trees.
- The \(\texttt{mtry}\) parameter aims to make trees less alike and hence more uncorrelated.

Trees and forests Random forests: Pros and Cons

Random forest are competitive to many state of the art learners.
While not having “best performance” in terms off accuracy they are not too far behind.
Advantages of random forests are
- Default hyperparameters already lead to very strong results (i.e. without tuning hyperparameter)
  - \(\texttt {mtry}=\lfloor \sqrt{p}\rfloor\)
  - min node size= 1 for classification, 5 for regression
- Can be implemented (and is implemented) very fast
Random forest (and also trees) don’t perform well for additive functions

Decision trees are not good with additive functions (example)

Consider \(m^\ast(x) = \sum_j^p \mathbb 1(x_j \leq 0)\) for large \(p\).
For a perfect fit, a tree algorithm would have to grow a tree with depth \(p\), where each leaf is the result of splitting once with respect to each covariate.
Hence, we end up with \(2^p\) leaves, which on average contain \(n/(2^p)\) data points.
Even if all splits are optimal for \(2^p>n\), the tree will not be able to lead to a perfect decision rule.

Trees and forests Random forests: Pros and Cons

While predictions of random forests are relatively smooth, they are not monotonic.
- e.g. in car insurance, if everything else is the same, more mileage should lead to higher insurance price.
Random forests deal well with sparsity, i.e., when the number of features \(p\) is large, but the number of relevant features is rather small: \(s<<p\).

References

Breiman, Leo. 1984. Classification and Regression Trees. Routledge.