Lecture 5: Trees and forests

Munir Hiabu

Trees and forests

  • In this lecture we will look at the nonparametric regression problem with squared loss \(L(y_1,y_2)=(y_1-y_2)^2\) and also the classification problem with Binary loss function \(L(y_1,y_2)=\Bbb 1(y_1\neq y_2)\) or squared loss.

  • We already know that the Bayes-rule is

    • \(m^\ast(x)=\mathbb E[Y|X=x]\) for the squared loss,
    • \(m^\ast(x)=\underset{k=1,\dots,K}{\operatorname{argmax}} \mathbb P(Y=k|X=x)\) for the binary loss.

Trees and forests Decision trees

  • Decision trees are estimators based on a partition of the feature space \(\mathcal X\), \(E[Y|X=x]=m^\ast(x)=\sin(2\pi x_1)\).

  • The tree estimate is a piecewise constant function.

  • The constant areas of the tree estimate are called leaves or terminal nodes.

  • The leaves partition the feature space \(\mathcal X:\)

    • the union of all leaves is equal \(\mathcal X\).
    • leaves are non-overlapping, i.e, their interception is empty.
  • The value within a leave is given by averaging the responses (regression with \(L_2\) loss) or majority vote (classification with binary loss).

Trees and forests Tree representation

  • Decision trees are popular because the decision making (estimate) is nicely visualizable/interpretable if not too deep:

  • Left and right picture below are two different representation of the same tree estimate.

  • The interior edges on the right plot are called nodes and can also be interpreted as subset of \(\mathcal X\)

Trees and forests Tree: Classification example

  • Decision trees can be quite efficient in classification tasks where we are interested in 0-1 (binary) decisions.
    • This is the case in many application but is often not the case in insurance

Trees and forests Definition of a tree

Definition (Tree)

Let \((\mathcal X, \mathcal F)\) and \((\mathcal Y, \mathcal E)\) be measurable spaces and consider the supervised learning problem corresponding to the product space \((\mathcal X \times \mathcal Y, \mathcal F \otimes \mathcal E, P)\) where \(P\) is some probability measure.
A tree model \(T\) consists of some finite system \(\mathcal B \subseteq \mathcal F\) of \(\mathcal F\)-measurable sets and a function \(f: \mathcal B \smallsetminus \{\mathcal X\} \rightarrow \mathcal B\) that fulfill the following conditions:

  • \(\mathcal X \in \mathcal B\) (Root node)
  • \(\forall B_1, B_2 \in \mathcal B : B_1 \cap B_2 = \emptyset \vee B_1 \subseteq B_2 \vee B_2 \subseteq B_1\) (Nodes cannot partially overlap)
  • \(\forall B_1, B_2 \in \mathcal B : B_1 \subseteq B_2 \Rightarrow B_2 \smallsetminus B_1 \in \mathcal B\) (Splitting a node always produces two nodes)
  • \[ \forall B \in \mathcal B \exists d_B \in \mathbb N_{0} : f^{d_B}(B) := \underbrace{f \circ \cdots \circ f}_{d_B \text{ times}}(B) = \mathcal X \] We say that \(d_B\) is the depth of the node \(B\).

Trees and forests Definition of a tree

Definition (Leaves)

Given a tree model \(T = (\mathcal B, f)\), we say that \[ \mathcal A = \mathcal A(T) = \{A \in \mathcal B \mid f^{-1}(\{A\}) = \emptyset\} \] is the set of leaves (or terminal nodes) for the tree.

Trees and forests Definition of a tree

Trees and forests Definition of a tree

Definition (Tree estimator)

  • \(T\) tree model
  • Data \(\mathcal D_n = (X_i, Y_i)_{i = 1, \ldots, n}\).

In the case of regression, the emprical squared loss-minimizing decision function produced by the tree model is \[ \hat{m}_n(x) = \sum_{A \in \mathcal A} \mathbb 1{x \in A}\left(\frac{\sum_{i=1}^n \mathbb 1\{X_i \in A\}Y_i}{\sum_{i=1}^n \mathbb 1\{X_i \in A\}}\right). \] In the case of classification, the empirical classification error minimizing decision function is \[ \hat{m}_n(x) = \sum_{A \in \mathcal A} \mathbb 1\{x \in A\}\cdot \textrm{argmax}_{y \in \mathcal Y}\left(\sum_{i=1}^n \mathbb 1\{X_i \in A\}\mathbb 1\{Y_i = y\}\right). \]

Remark

Note that a decision function stemming from a decision tree is uniquely defined by the leaves of the tree.

Trees and forests How to learn a decision tree: CART

  • A decision tree is uniquely defined by its leaves.

  • The naive approach of looking for leaves that minimize a certain loss (and ignoring the tree structure) is practically not feasible because of the computational cost.

  • Instead: top-down greedy approach, called CART (classification and regression trees).

  • We now describe how to construct nodes via the CART algorithm.

CART (heuristic)

  • Start with the whole space \(\mathcal X\) (=root node).
  • Pick variable \(j=1,\dots,p\) and split-point \(c \in \textrm{supp}(X_j)\) that gives the best loss improvement.
  • Split \(\mathcal X\) into two nodes according to variable and split point.
  • Next, if no stopping rule applies, both nodes (otherwise only one or none) are split into two further nodes.
  • Continue until stopping rule applies to all nodes.

Trees and forests How to learn a decision tree: CART

CART algorithm (start)

Start with \(\mathcal X\) as initial node for splitting.

CART algorithm (splitting)

  • Input: node \(B\subseteq \mathcal X\)
  • For every dimension \(j=1,\dots,p\) and every point \(s_j\in \{x_j| \exists\ x_1,\dots,x_{j-1},x_{j+1},\dots,x_p\ \text{with} (x_1,\dots,x_p)\in B\}\) define \[B_1(j,s_j)=\{(x_1,\dots,x_p)\in B| \ x_j \leq s_j\}, \quad B_2(j,s_j)=\{(x_1,\dots,x_p)\in B| \ x_j > s_j\}.\]
  • We pick the minimizer \[ (j^\ast,s^\ast)= \arg\min_{j,s} Q_n(B_1(j,s_j)) + Q_n(B_2(j,s_j)) \]
  • Each of the new nodes \((B_1(j^\ast,s^\ast),B_2(j^\ast,s^\ast))\) is further split as long as stopping criterion does not apply.

Trees and forests How to learn a decision tree: CART

CART Algorithm: the loss \(Q\)

For regression the most common loss function is the squared loss:

  • Squared loss: \(Q_n(B_l)= \sum_{i: X_i \in B_l} (Y_i - \overline Y_i(B_l))^2\)
    • \(\overline Y_i(R_l)= \frac 1 {|R_l|}\sum_{i: X_i \in B_l}Y_i,\) \(|B_j|=\#\{i: X_i\in R_j\}\).

In the classification case, define \(\hat p_{lk}= \frac 1 {|B_l|} \sum_{i: X_i \in B_l} \mathbb 1(Y_i=k)\) (=the proportion of class \(k\) observation in \(B_l\))

Common loss functions are:

  • Brier score: \(Q_n(B_l)= \sum_{i: X_i \in B_l} (Y_i - \overline Y_i(B_l))^2\) (=squared loss)
  • Misclassification error: \(Q_n(B_l)= 1 - \max_{k}\hat p_{lk}\)
  • Gini index: \(Q_n(B_l)= \sum_{k=1}^K \hat p_{lk}(1-\hat p_{lk})\)
  • Entropy : \(Q_n(B_l)=-\sum_{k=1}^K \hat p_{lk}\log\hat p_{lk}\)
  • Note: In classification, even if binary loss is the ultimate goal, Gini index and entropy might be the better choices for splitting.

Trees and forests How to learn a decision tree: CART

CART algorithm (stopping criterion)

  • The most common stopping criterion is specifying a min-node size, i.e. stop if \(|B_j|<c\). A common choice is \(c=5\).

  • An alternative stopping criteria is the depth of \(B_j\), i.e., the number of parent nodes of \(B_j\).

Trees and forests How to learn a decision tree: CART, Pruning

  • The process described so far will probably lead to an overfit.

  • One may think that one way out is to stop growing the tree early enough.

    • This is not advisable because stopping the growing early because the current split does not lead to great improvement in the loss does not mean that a split afterwards might not turn very effective.
  • In the problem below, the first split will not lead to big improvements in the loss We will need at least 2 (probably more) splits until we arrive at a good decision rule.

Trees and forests How to learn a decision tree: CART, Pruning

  • The idea of pruning is to let a tree grow very deep first (= small min node size) and then select a subtree (pruning) as final fit.

  • Idea: We can compare different subtrees via a penalized loss.

Definition (Subtree)

Consider a tree model \(T = (\mathcal B, f)\). We say that a tree model \(T' = (\mathcal B', f')\) is a sub-tree if \(B' \subseteq \mathcal B\) and \(f = f'\) on \(\mathcal B'\). Note that we may abuse the notation and write \((\mathcal B', f)\) instead of \((\mathcal B', f')\).

  • We can derive a subtree, by pruning a tree at any non-terminal node (i.e. by deleting all descendants of that node)

Trees and forests How to learn a decision tree: CART, Pruning

Definition (\(\alpha\)-pruned tree)

Consider a tree \(\hat T_n\) with corresponding estimator \(\hat m_n^{\hat T_n}\). Given a parameter \(\alpha>0\), the \(\alpha\)-pruned tree is \[ \hat T_{n,\alpha} := \arg \min_{T\subseteq \hat T_n}\{\hat R_n(\hat m_n) + \alpha |T|\}, \] where \(|T|\) is the number of leaves of \(T\).

  • How do we find an optimal \(\alpha\) (and also the optimal subtree for a fixed \(\alpha\))?
    • brute force cross-validation will be too computationally intensive even if we know the answer to the second question.
    • The answer to both questions is the weakest link algorithm.

Trees and forests How to learn a decision tree: CART, Pruning

Proposition (weakest link, see Breiman (1984))

\(\hat T_{n,\alpha}\) is unique. Furthermore, there are trees \(T_0\supset \cdots\supset T_l\) with \(\{T_0,\dots, T_l\}=\{\hat T_{n,\alpha} : \alpha>0\}\); in particular the set \(\{\hat T_{n,\alpha} : \alpha>0\}\) is finite. Let \(Q_n\) be the loss function corresponding to \(\hat R_n\). The trees \(T_0,\dots, T_l\) can be found with the following algorithm

  1. Let \(B_L,B_R\) be any two terminal nodes in \(\hat T_n\) resulting from a split of the immediate ancestor node \(t\). If \(Q_n(t) =Q_n(B_L)+Q_n(B_R)\), prune off \(B_L\) and \(B_R\). Continue this process until no more pruning is possible. The resulting tree is \(T_0\), \(\alpha_0=0\).

  2. \(k=0\)

  3. Go through all non-terminal nodes \(B\) of \(T_k\) and calculate \[ g(B)=\frac{Q_n(B)-Q_n(S_t)}{|S_t|-1}, \] where \(S_t\) is the tree (branch) with root \(B\) and nodes consisting of \(B\) and the descendants of \(B\) in \(T_k\). \(Q_n(S_t)=\sum_{A \ \text{leaf of}\ S_t} Q_n(A)\).

  4. \(\alpha_k= \min_{B \ \text{ non-terminal node of} \ T_k}g(B)\).

  5. From top to bottom in \(T_k\) prune all non-terminal nodes with \(g(B)=\alpha_k\) and call the resulting tree \(T_{k+1}\).

  6. If \(T_{k+1}\) has more than one node, set \(k \leftarrow k+1\) and go to step 3.

Trees and forests How to learn a decision tree: CART, Pruning

  • \(g(B)\) is the improvement-per-leaf ratio for that particular node.
  • If we were to prune all children for some node \(B\),
    • the loss would increase by exactly \(g(B) \cdot (|\{A| \ A \ \text{is leaf and child of} \ B\}| - 1)\)
    • and because of the penalty the loss would decrease by \(\alpha \cdot (|\{A| \ A \ \text{is leaf and child of} \ B\}| - 1)\).
    • In total the change is \((g(B)-\alpha)\cdot (|\{A| \ A \ \text{is leaf and child of} \ B\}| - 1)\)
  • If \(\alpha\) is smaller than all \(g(B)\) then nothing needs to be pruned.
  • The weakest link algorithm deletes the node with smallest \(g(B)\) and the corresponding \(\alpha\) is set to that \(g(B)\).

Trees and forests How to learn a decision tree: CART, Pruning

  • Given \(l\) trees \(\{T_0,\dots, T_l\}=\{\hat T_{n,\alpha} : \alpha>0\}\), we can pick the optimal tree via cross validation.

Problem with decision trees

There are (at least) two major problems with decision trees

  • Instability: Small variations in the data can lead to a very different tree

  • Performance: The performance (measured via test error) is usually much weaker compared to other learning algorithms

Trees and forests Problem of trees: Instability

  • Below you can see 9 pruned decision trees (optimal \(\alpha\) determined via cross-validation) where data is different each time but from the same generating mechanism:

    • Estimates are based on 300 iid observations with \(Y\) given by \[ \textrm{sign}\left\{(\mathbb 1[(X_1>0.5\&X_2>0.5) \ \text{or} \ (X_1<0.5\&X_2<0.5)]\times2-1) \times \varepsilon\right\}, \] \(X_1\sim N(0,1), X_2\sim N(0,1), \varepsilon \sim N(0,0.5^2)\).

Trees and forests Bagging

  • Bagging stands for Bootstrap aggregation.

  • Idea: Generate artificial data via bootstrap, build a decision tree for each data-set and average.

  • Hope: It reduces the variance of decision trees.

Definition (Bagging)

  • Given data \({(X_1,Y_1),\dots, (X_n,Y_n)}\), draw \(B\) bootstrap samples \(\tilde {\mathcal D}^{(b)}_n={(\tilde X^{(b)}_1,\tilde Y^{(b)}_1),\dots, (\tilde X^{(b)}_n,\tilde Y^{(b)}_n)}, b=1,\dots, B\), i.e., \(\tilde {\mathcal D}^{(b)}_n\) arises from \(\mathcal D_n\) by drawing \(n\) times with replacement.

  • The bagging estimator is defined as

    • squared loss \(\rightarrow\) average: \(\hat m^{bagg}_n=\frac 1 B \sum_{b=1}^B \hat m_n(\tilde {\mathcal D}^{(b)}_n)\)
    • binary loss \(\rightarrow\) majority vote: \(\hat m^{bagg}_n(x)=\arg \max_{k\in {1,\dots K}} \#\{b: m_n(\tilde {\mathcal D}^{(b)}_n) (x)=k\}\)
  • Note that \(\hat m^{bagg}_n\) is a stochastic estimator, even given \(\mathcal D_n\).

  • Note: Usually, bagging is applied on non-pruned trees. (See it as an alternative way to reduce variance)

Trees and forests Random forests

  • While bagging improves performance of decision trees, interpretability is worsened.

  • If interpretability is not a concern, then bagging is still inferior compared to other learning techniques.

  • Random forests are a modification on bagging estimators.

Definition (Random forest)

  • As bagging random forest are an ensemble of decision trees.

  • They add the following modification to bagging:

    • \(\texttt{mtry}\): Each time a node is split it is only done on a subset of size \(\texttt{mtry}\) of viable variables. I.e., each time you want to split a node, turn on a random generator and draw \(1\leq\texttt{mtry}<p\) viable variables from \(\{1,\dots,p\}\). Only those variables drawn are allowed to be considered for the next split.
  • We denote the random forest estimator by \(\hat m^{rf}_n\). Note that \(\hat m^{rf}_n\), as \(\hat m^{bagg}_n\), is a stochastic estimator, even given \(\mathcal D_n\).

Trees and forests Random forests

Why is \(\texttt{mtry}\) so useful?

  • Assume that we are given \(l\) estimators \(m_n(\mathcal D_n, U_1), \dots m_n(\mathcal D_n, U_l)\)

  • Here, \(U_1,\dots, U_l\) are iid variables inducing the additional stochasticity of the estimators (e.g. through random forest or bagging).

  • Assume that \(\hat m_n(\mathcal D_n, U_1), \dots \hat m_n(\mathcal D_n, U_l)\) have pairwise correlation \(\rho\) and variance \(\sigma^2\).

  • We get \[\begin{aligned} \textrm{Var}\left (\frac 1 l \sum_{j=1}^l \hat m_n(\mathcal D_n, U_j)\right)&= \frac 1 {l^2} \sum_{jk} \textrm{Cov}(m_n(\mathcal D_n, U_j),m_n(\mathcal D_n, U_k))\\&=\frac 1 {l^2} \left( l\sigma^2+ (l^2-l)\rho\sigma^2\right)\\&= \rho \sigma^2 + \frac{1-\rho}{l}\sigma^2 \end{aligned}\]

  • Meaning: Variance of an ensemble of trees is small the smaller the correlation between the trees.

    • The \(\texttt{mtry}\) parameter aims to make trees less alike and hence more uncorrelated.

Trees and forests Random forests: Pros and Cons

  • Random forest are competitive to many state of the art learners.
  • While not having “best performance” in terms off accuracy they are not too far behind.
  • Advantages of random forests are
    • Default hyperparameters already lead to very strong results (i.e. without tuning hyperparameter)
      • \(\texttt {mtry}=\lfloor \sqrt{p}\rfloor\)
      • min node size= 1 for classification, 5 for regression
    • Can be implemented (and is implemented) very fast
  • Random forest (and also trees) don’t perform well for additive functions

Decision trees are not good with additive functions (example)

  • Consider \(m^\ast(x) = \sum_j^p \mathbb 1(x_j \leq 0)\) for large \(p\).
  • For a perfect fit, a tree algorithm would have to grow a tree with depth \(p\), where each leaf is the result of splitting once with respect to each covariate.
  • Hence, we end up with \(2^p\) leaves, which on average contain \(n/(2^p)\) data points.
  • Even if all splits are optimal for \(2^p>n\), the tree will not be able to lead to a perfect decision rule.

Trees and forests Random forests: Pros and Cons

  • While predictions of random forests are relatively smooth, they are not monotonic.
    • e.g. in car insurance, if everything else is the same, more mileage should lead to higher insurance price.
  • Random forests deal well with sparsity, i.e., when the number of features \(p\) is large, but the number of relevant features is rather small: \(s<<p\).

References

Breiman, Leo. 1984. Classification and Regression Trees. Routledge.