Lecture 6: Gradient Boosting Machines

Munir Hiabu

Gradient Boosting Machines

In the previous lecture we have discussed that random forests are not good in estimating additive structures.
In this lecture we will see two tree-based estimators that can deal better with additive structures.

Gradient Boosting Machines Boosting

The general idea of boosting is to construct an estimator of the form

\[\hat m_n(x)=\sum_{j=1}^B \hat m_{n,j}(x),\] where \(\hat m_{n,b}\) is the result of improving on the estimator \(\hat m_n^{(b-1)}(x)=\sum_{j=1}^{b-1} \hat m_{n,j}(x)\).

Gradient Boosting Machines Forward Stagewise Additive Modeling

Definition (Forward Stagewise Additive Modeling)

The function \(\hat m_n(x)=\sum_{j=1}^B \hat m_{n,j}(x)\) is called Forward Stagewise Additive Modeling estimator if

\(m_{n,0}(x)=\arg \min_{\eta\in \mathcal G} \sum_{i=1}^n L\left(Y_i, \eta\right)\) and
for \(b=1,\dots,B\) \[\hat m_{n,b} = \arg \min_{\eta\in \mathcal G} \hat R_n\left( \hat m_n^{(b-1)}(x)+\eta(x)\right),\] where \(\hat m_n^{(b-1)}=\sum_{j=1}^{b-1}\hat m_{n,j}.\)
The set \(\mathcal G\) is chosen to be very restrictive/small.
The component \(m_{n,b}\) is called weak learner.

Gradient Boosting Machines Forward Stagewise Additive Modeling

Problem: Finding the exact minimizer is only feasible (not harder than than the usual minimisation) for very specific loss functions \(\hat R_n\).
- In the case of squared loss, minimization in step \(b\) boils down to minimizing the empirical squared loss with respect to the residuals \(Y_i- m_n^{(b-1)}(X_i)\).
  - \(L\left(Y_i, \hat m_n^{(b-1)}(X_i)+\eta(X_i)\right)=L\left(Y_i-\hat m_n^{(b-1)}(X_i),\eta(X_i)\right)\)
- In the classification case with \(\{-1,1\}\) valued response and an exponential loss \((L(y_1,y_2)=e^{-y_1y_2})\), minimization in step \(b\) boils down to minimizing the weighted empirical exponential loss with observation \(i\) receiving weight \(e^{-Y_im_n^{(b-1)}(X_i)}\).
  - \(L\left(Y_i, \hat m_n^{(b-1)}(X_i)+\eta(X_i)\right)=e^{Y_im_n^{(b-1)}(X_i)}L\left(Y_i,\eta(X_i)\right)\)
  - This algorithm is also known as AdaBoost.M1. Adaboost.M1 was introduced in Freund and Schapire (1997) and was only later (J. Friedman, Hastie, and Tibshirani 2000) identified as Forward Stagewise Additive Modeling with exponential loss.

Gradient Boosting Machines First version

Definition (Gradient Boosting Machine v1) (J. H. Friedman 2001)

Gradient Boosting approximates Forward Stagewise Additive Modeling.
Tuning parameters:
- function space \(\mathcal G\)
- learning rate \(\eta\)
- Number of weak learners \(B\)

Initialize \(\hat m_{n}^{(0)}(x)=\arg \min_{m \in \mathcal G} \sum_{i=1}^N L\left(Y_i, m\right)\).
For \(b=1,\dots,B\) :
1. For \(i=1,2, \ldots, n,\) calculate the gradient: \[ g_{ib}=-\frac{\partial L(Y_i,y)}{\partial y} \Bigr\rvert_{y=m_n^{(b-1)}(X_i)} . \]
2. \((\xi_b, \tilde m_{n,b})= \arg \min_{\xi\in \mathbb R_{>0},m \in \mathcal G}\sum_{i=1}^n(g_{ib}-\xi m(X_i))^2\) (\(\xi_b\) is only necessary if \(\mathcal G\) is not closed under multiplication by constants)
3. \(\alpha_b= \arg \min_{\alpha} \hat R_n(m_n^{(b-1)}+\alpha \tilde m_{n,b})\) (line search)
4. \(\hat m_{n,b}(x)= \alpha_b \tilde m_{n,b}(x)\) (weak learner)
5. Update: \(\hat m^{(b)}_n(x)=\hat m^{(b-1)}_n(x)+\eta \hat m_{n,b}(x)\).

Gradient Boosting Machines First version

Definition (Gradient Boosting Machine for trees v1) (J. H. Friedman 2001)

Tuning parameters:
- max depth \(J\) \(\rightarrow \mathcal G=\{\text{trees of max depth} \ J\}\)
- learning rate \(\eta\)
- Number of weak learners \(B\)

Initialize \(\hat m_{n}^{(0)}(x)=\arg \min_{m \in \mathcal G} \sum_{i=1}^N L\left(Y_i, m\right)\).
For \(b=1,\dots,B\) :
1. For \(i=1,2, \ldots, n,\) calculate the gradient: \[ g_{ib}=-\frac{\partial L(Y_i,y)}{\partial y} \Bigr\rvert_{y=m_n^{(b-1)}(X_i)} . \]
2. Fit a regression tree with squared loss and max depth \(J\) to the targets \(g_{ib}\). \(\rightarrow\) leaves \(A_{bj}, j=1,2, \ldots, J_b\).
3. For \(j=1,2, \ldots, J_b\) calculate the value for leaf \(R_{bj}\): \[ v_{bj}=\arg \min_{ v_{bj} \in \mathbb R} \sum_{i: X_i \in A_{kj}} L\left(Y_i, m_n^{(b-1)}(X_i)+ v_{bj}\right) . \]
4. \(\hat m_{n,b}(x)=\sum_{j=1}^{J_b} v_{bj} 1\left(X_i \in A_{b j}\right)\)
5. Update: \(\hat m^{(b)}_n(x)=\hat m^{(b-1)}_n(x)+ \eta \hat m_{n,b}(x)\).

Gradient Boosting Machines First version

The main difference between the tree gradient booster and general gradient boosting is
- Least squares problem is not solved directly but greedy step-wise by growing a tree via CART-algorithm.
- Instead of a line search we minimize the empirical loss in each leaf.

Gradient Boosting Machines Second version

More recently Chen and Guestrin (2016) proposed to optimize a regularized objective in the Forward Stagewise Additive Modeling with the following further changes:
1. Replace \(\hat R_n\) by \(\hat R_{n,\gamma,\lambda}(\hat m_n^{(b-1)}+m_{n,b}) =\hat R_n(\hat m_n^{(b-1)}+m_{n,b}) + J_{\gamma,\lambda}(m_{n,b})\)
  - \(J_{\gamma,\lambda}(m_{n,b})= \gamma J_b+ \frac 1 2 \lambda ||v_b||_2^2\)
    - \(J_b\)= number of leaves of tree in iteration \(b\)
    - \(v_b=\) leaf values of tree in iteration \(b\)
2. Use second order approximation instead of gradient descent (see next slide)
  - i.e. gradient descent in 2b,c is replaced by a Newton-Raphson type approximation
The algorithm is implemented in the \(\texttt{xgboost}\) package.
- Main hyperparameters:
  - Penalties: \(\gamma\), \(\lambda\)
  - learning rate: \(\eta\)
  - max tree depth
  - number of weak learners \(B\)

Gradient Boosting Second version

Use a second order approximation of \(\hat R_{n,\gamma,\lambda}\) \[\begin{align} &\hat R_{n,\gamma,\lambda}(\hat m_n^{(b-1)}(x))+\hat m_{n,b}(x))\\ \\&\approx\sum_{i=1}^n \left\{ L(Y_i,\hat m_n^{(b-1)}(X_i))+ g_i\hat m_{n,b}(X_i)+ \frac 1 2 h_i \hat m_{n,b}(X_i)^2\right \}+ J_{\gamma,\lambda}(m_{n,b})\\ &= \sum_{A \in \mathcal A_b}\left \{ \left[\sum_{i: X_i \in A}g_i\right]v_{A} + \frac 1 2 \left[\sum_{i: X_i \in R_{A}}h_i+\lambda\right]v_{A}^2 \right\}+\gamma J_b \quad \text{(ignoring constants)} \end{align}\]
- \(g_i= \frac{\partial L(Y_i,y_i)}{\partial y_i} \Bigr\rvert_{y_i=m_n^{(b-1)}(X_i)}\)
- \(h_i= \frac{\partial^2 L(Y_i,y_i)}{\partial^2 y_i} \Bigr\rvert_{y_i=m_n^{(b-1)}(X_i)}\)

Gradient Boosting Machines Second version

In the last expression, for a fixed tree, the optimal leaf values \(v_{A}\) are given by \[v_{A}=-\frac{\sum_{i: X_i \in A} g_i }{\sum_{i: X_i \in A} h_i+\lambda}\]
This leads to the objective function \[-\frac 1 2\sum_{A \in \mathcal A_b} \frac {(\sum_{i: X_i \in A}g_i)^2}{\sum_{i: X_i \in A}h_i+\lambda}+\gamma J_b\]
Hence, in step \(b\), when growing a tree the loss reduction of splitting \(A\) in \(A_{L},A_{R}\) is given by \[ \frac 1 2 \left\{\frac {(\sum_{i: X_i \in A_{L}}g_i)^2}{\sum_{i: X_i \in A_{L}}h_i+\lambda}+\frac {(\sum_{i: X_i \in A_{R}}g_i)^2}{\sum_{i: X_i \in A_{R}}h_i+\lambda}-\frac {(\sum_{i: X_i \in A}g_i)^2}{\sum_{i: X_i \in A}h_i+\lambda}\right\}-\gamma \]
In step \(b\) the \(j\)th node is split a dimension and position that maximises the loss reduction, given that the maximising loss reduction is positive and the current node has depth smaller \(J\). Otherwise the node is not further split.

Gradient Boosting Machines Remarks

Gradient boosting machines are known to often provide the strongest predictive performance.
\(\texttt{xgboost}\), \(\texttt{lightgbm} and \texttt{CatBoost}\) are popular implementations
They are quite fast, but not as fast as random forests and also more reliant on optimal parameter tuning
- The most relevant parameters are tree depth and learning rate
Why are gradient boosting machines so powerful?
- A contributing factor may be that a small max depth restricts the number of interactions fitted.
  - max depth=\(1\) corresponds to an additive model etc….
- Gradient boosting machines, in contrast to random forests have also, no problem to fit additive functions.
  - Different additive components can be fitted in different iterations \(b\).

References

Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94.

Freund, Yoav, and Robert E Schapire. 1997. “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting.” Journal of Computer and System Sciences 55 (1): 119–39.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2000. “Additive Logistic Regression: A Statistical View of Boosting (with Discussion and a Rejoinder by the Authors).” The Annals of Statistics 28 (2): 337–407.