\[\hat m_n(x)=\sum_{j=1}^B \hat m_{n,j}(x),\] where \(\hat m_{n,b}\) is the result of improving on the estimator \(\hat m_n^{(b-1)}(x)=\sum_{j=1}^{b-1} \hat m_{n,j}(x)\).
Definition (Forward Stagewise Additive Modeling)
The function \(\hat m_n(x)=\sum_{j=1}^B \hat m_{n,j}(x)\) is called Forward Stagewise Additive Modeling estimator if
\(m_{n,0}(x)=\arg \min_{\eta\in \mathcal G} \sum_{i=1}^n L\left(Y_i, \eta\right)\) and
for \(b=1,\dots,B\) \[\hat m_{n,b} = \arg \min_{\eta\in \mathcal G} \hat R_n\left( \hat m_n^{(b-1)}(x)+\eta(x)\right),\] where \(\hat m_n^{(b-1)}=\sum_{j=1}^{b-1}\hat m_{n,j}.\)
The set \(\mathcal G\) is chosen to be very restrictive/small.
The component \(m_{n,b}\) is called weak learner.
Definition (Gradient Boosting Machine v1) (J. H. Friedman 2001)
Definition (Gradient Boosting Machine for trees v1) (J. H. Friedman 2001)
In the last expression, for a fixed tree, the optimal leaf values \(v_{A}\) are given by \[v_{A}=-\frac{\sum_{i: X_i \in A} g_i }{\sum_{i: X_i \in A} h_i+\lambda}\]
This leads to the objective function \[-\frac 1 2\sum_{A \in \mathcal A_b} \frac {(\sum_{i: X_i \in A}g_i)^2}{\sum_{i: X_i \in A}h_i+\lambda}+\gamma J_b\]
Hence, in step \(b\), when growing a tree the loss reduction of splitting \(A\) in \(A_{L},A_{R}\) is given by \[ \frac 1 2 \left\{\frac {(\sum_{i: X_i \in A_{L}}g_i)^2}{\sum_{i: X_i \in A_{L}}h_i+\lambda}+\frac {(\sum_{i: X_i \in A_{R}}g_i)^2}{\sum_{i: X_i \in A_{R}}h_i+\lambda}-\frac {(\sum_{i: X_i \in A}g_i)^2}{\sum_{i: X_i \in A}h_i+\lambda}\right\}-\gamma \]
In step \(b\) the \(j\)th node is split a dimension and position that maximises the loss reduction, given that the maximising loss reduction is positive and the current node has depth smaller \(J\). Otherwise the node is not further split.