In this lecture we will look at the nonparametric regression problem with squared loss \(L(y_1,y_2)=(y_1-y_2)^2\) and also the classification problem with Binary loss function \(L(y_1,y_2)=\Bbb 1(y_1\neq y_2)\) or squared loss.
We already know that the Bayes-rule is
The tree estimate is a piecewise constant function.
The constant areas of the tree estimate are called leaves or terminal nodes.
The leaves partition the feature space \(\mathcal X:\)
The value within a leave is given by averaging the responses (regression with \(L_2\) loss) or majority vote (classification with binary loss).
Decision trees are popular because the decision making (estimate) is nicely visualizable/interpretable if not too deep:
Left and right picture below are two different representation of the same tree estimate.
The interior edges on the right plot are called nodes and can also be interpreted as subset of \(\mathcal X\)
Definition (Tree)
Let \((\mathcal X, \mathcal F)\) and \((\mathcal Y, \mathcal E)\) be measurable spaces and consider the supervised learning problem corresponding to the product space \((\mathcal X \times \mathcal Y, \mathcal F \otimes \mathcal E, P)\) where \(P\) is some probability measure.
A tree model \(T\) consists of some finite system \(\mathcal B \subseteq \mathcal F\) of \(\mathcal F\)-measurable sets and a function \(f: \mathcal B \smallsetminus \{\mathcal X\} \rightarrow \mathcal B\) that fulfill the following conditions:
Definition (Leaves)
Given a tree model \(T = (\mathcal B, f)\), we say that \[ \mathcal A = \mathcal A(T) = \{A \in \mathcal B \mid f^{-1}(\{A\}) = \emptyset\} \] is the set of leaves (or terminal nodes) for the tree.
Definition (Tree estimator)
In the case of regression, the emprical squared loss-minimizing decision function produced by the tree model is \[ \hat{m}_n(x) = \sum_{A \in \mathcal A} \mathbb 1{x \in A}\left(\frac{\sum_{i=1}^n \mathbb 1\{X_i \in A\}Y_i}{\sum_{i=1}^n \mathbb 1\{X_i \in A\}}\right). \] In the case of classification, the empirical classification error minimizing decision function is \[ \hat{m}_n(x) = \sum_{A \in \mathcal A} \mathbb 1\{x \in A\}\cdot \textrm{argmax}_{y \in \mathcal Y}\left(\sum_{i=1}^n \mathbb 1\{X_i \in A\}\mathbb 1\{Y_i = y\}\right). \]
Remark
Note that a decision function stemming from a decision tree is uniquely defined by the leaves of the tree.
A decision tree is uniquely defined by its leaves.
The naive approach of looking for leaves that minimize a certain loss (and ignoring the tree structure) is practically not feasible because of the computational cost.
Instead: top-down greedy approach, called CART (classification and regression trees).
We now describe how to construct nodes via the CART algorithm.
CART (heuristic)
CART algorithm (start)
Start with \(\mathcal X\) as initial node for splitting.
CART algorithm (splitting)
CART Algorithm: the loss \(Q\)
For regression the most common loss function is the squared loss:
In the classification case, define \(\hat p_{lk}= \frac 1 {|B_l|} \sum_{i: X_i \in B_l} \mathbb 1(Y_i=k)\) (=the proportion of class \(k\) observation in \(B_l\))
Common loss functions are:
CART algorithm (stopping criterion)
The most common stopping criterion is specifying a min-node size, i.e. stop if \(|B_j|<c\). A common choice is \(c=5\).
An alternative stopping criteria is the depth of \(B_j\), i.e., the number of parent nodes of \(B_j\).
The process described so far will probably lead to an overfit.
One may think that one way out is to stop growing the tree early enough.
In the problem below, the first split will not lead to big improvements in the loss We will need at least 2 (probably more) splits until we arrive at a good decision rule.
The idea of pruning is to let a tree grow very deep first (= small min node size) and then select a subtree (pruning) as final fit.
Idea: We can compare different subtrees via a penalized loss.
Definition (Subtree)
Consider a tree model \(T = (\mathcal B, f)\). We say that a tree model \(T' = (\mathcal B', f')\) is a sub-tree if \(B' \subseteq \mathcal B\) and \(f = f'\) on \(\mathcal B'\). Note that we may abuse the notation and write \((\mathcal B', f)\) instead of \((\mathcal B', f')\).
Definition (\(\alpha\)-pruned tree)
Consider a tree \(\hat T_n\) with corresponding estimator \(\hat m_n^{\hat T_n}\). Given a parameter \(\alpha>0\), the \(\alpha\)-pruned tree is \[ \hat T_{n,\alpha} := \arg \min_{T\subseteq \hat T_n}\{\hat R_n(\hat m_n) + \alpha |T|\}, \] where \(|T|\) is the number of leaves of \(T\).
Proposition (weakest link, see Breiman (1984))
\(\hat T_{n,\alpha}\) is unique. Furthermore, there are trees \(T_0\supset \cdots\supset T_l\) with \(\{T_0,\dots, T_l\}=\{\hat T_{n,\alpha} : \alpha>0\}\); in particular the set \(\{\hat T_{n,\alpha} : \alpha>0\}\) is finite. Let \(Q_n\) be the loss function corresponding to \(\hat R_n\). The trees \(T_0,\dots, T_l\) can be found with the following algorithm
Let \(B_L,B_R\) be any two terminal nodes in \(\hat T_n\) resulting from a split of the immediate ancestor node \(t\). If \(Q_n(t) =Q_n(B_L)+Q_n(B_R)\), prune off \(B_L\) and \(B_R\). Continue this process until no more pruning is possible. The resulting tree is \(T_0\), \(\alpha_0=0\).
\(k=0\)
Go through all non-terminal nodes \(B\) of \(T_k\) and calculate \[ g(B)=\frac{Q_n(B)-Q_n(S_t)}{|S_t|-1}, \] where \(S_t\) is the tree (branch) with root \(B\) and nodes consisting of \(B\) and the descendants of \(B\) in \(T_k\). \(Q_n(S_t)=\sum_{A \ \text{leaf of}\ S_t} Q_n(A)\).
\(\alpha_k= \min_{B \ \text{ non-terminal node of} \ T_k}g(B)\).
From top to bottom in \(T_k\) prune all non-terminal nodes with \(g(B)=\alpha_k\) and call the resulting tree \(T_{k+1}\).
If \(T_{k+1}\) has more than one node, set \(k \leftarrow k+1\) and go to step 3.
Problem with decision trees
There are (at least) two major problems with decision trees
Instability: Small variations in the data can lead to a very different tree
Performance: The performance (measured via test error) is usually much weaker compared to other learning algorithms
Below you can see 9 pruned decision trees (optimal \(\alpha\) determined via cross-validation) where data is different each time but from the same generating mechanism:
Bagging stands for Bootstrap aggregation.
Idea: Generate artificial data via bootstrap, build a decision tree for each data-set and average.
Hope: It reduces the variance of decision trees.
Definition (Bagging)
Given data \({(X_1,Y_1),\dots, (X_n,Y_n)}\), draw \(B\) bootstrap samples \(\tilde {\mathcal D}^{(b)}_n={(\tilde X^{(b)}_1,\tilde Y^{(b)}_1),\dots, (\tilde X^{(b)}_n,\tilde Y^{(b)}_n)}, b=1,\dots, B\), i.e., \(\tilde {\mathcal D}^{(b)}_n\) arises from \(\mathcal D_n\) by drawing \(n\) times with replacement.
The bagging estimator is defined as
Note that \(\hat m^{bagg}_n\) is a stochastic estimator, even given \(\mathcal D_n\).
Note: Usually, bagging is applied on non-pruned trees. (See it as an alternative way to reduce variance)
While bagging improves performance of decision trees, interpretability is worsened.
If interpretability is not a concern, then bagging is still inferior compared to other learning techniques.
Random forests are a modification on bagging estimators.
Definition (Random forest)
As bagging random forest are an ensemble of decision trees.
They add the following modification to bagging:
We denote the random forest estimator by \(\hat m^{rf}_n\). Note that \(\hat m^{rf}_n\), as \(\hat m^{bagg}_n\), is a stochastic estimator, even given \(\mathcal D_n\).
Why is \(\texttt{mtry}\) so useful?
Assume that we are given \(l\) estimators \(m_n(\mathcal D_n, U_1), \dots m_n(\mathcal D_n, U_l)\)
Here, \(U_1,\dots, U_l\) are iid variables inducing the additional stochasticity of the estimators (e.g. through random forest or bagging).
Assume that \(\hat m_n(\mathcal D_n, U_1), \dots \hat m_n(\mathcal D_n, U_l)\) have pairwise correlation \(\rho\) and variance \(\sigma^2\).
We get \[\begin{aligned} \textrm{Var}\left (\frac 1 l \sum_{j=1}^l \hat m_n(\mathcal D_n, U_j)\right)&= \frac 1 {l^2} \sum_{jk} \textrm{Cov}(m_n(\mathcal D_n, U_j),m_n(\mathcal D_n, U_k))\\&=\frac 1 {l^2} \left( l\sigma^2+ (l^2-l)\rho\sigma^2\right)\\&= \rho \sigma^2 + \frac{1-\rho}{l}\sigma^2 \end{aligned}\]
Meaning: Variance of an ensemble of trees is small the smaller the correlation between the trees.
Decision trees are not good with additive functions (example)