In this lecture we will look at the linear regression problem with squared loss \(L(y_1,y_2)=(y_1-y_2)^2\)
We already know that the Bayes-rule is \(m^\ast(x)=\mathbb E[Y|X=x]\)
Linear Model Assumption
There are parameters \(\beta_0^\ast, \beta_1^\ast, \dots \beta_p^\ast\), with \[ m^\ast(x)= \beta_0^\ast + \sum_{j=1}^p \beta_j^\ast x_j. \] In other words, we assume that \(m^\ast\) is a linear function, i.e., \(m^\ast \in \mathcal G=\{f: \ \mathbb R^{p}\mapsto \mathbb R| \ f(x)=\beta_0^+ \sum_{j=1}^p \beta_j x_j,\ \beta_0,\dots,\beta_p\in \mathbb R \}\).
First Implication of Linear Model Assumption
Given iid training data \((X_i,Y_i)\), \(i=1,\dots,n\), we have an additive noise model \[ Y_i= \beta_0^\ast + \sum_{j=1}^p \beta_j^\ast X_{ij}+ \varepsilon_i, \] with \(\varepsilon_i=Y_i-m^{\ast}(X_i)\), and hence iid with \(\mathbb E[\varepsilon_i|X_i]=0.\)
By assuming \(E[Y|X=x] \in \mathcal G\)= linear functions
we only consider estimation error and ignore inductive bias/approximation error
An alternative is to consider \(\beta^\ast= \Sigma^{-1} \mathbb E[YX], \Sigma=\mathbb E[XX^T]\), i.e., the best linear predictor without assuming that \(\mathbb E[Y|X]\) is linear.
Lemma (Coefficients in linear model)
Under the Linear model assumption, we have for \(j=1,\dots,p\) \[ \beta_j^\ast= \frac{\rm{Cov}(X_{1j}, Y_1-\sum_{k\in\{1,\dots,p\}\setminus \{j\}}\beta_k^\ast X_{1k})}{\rm{Var}(X_{1j})}. \] In particular, if the components of \(X\) are uncorrelated, we have \[ \beta_j^\ast= \frac{\rm{Cov}(X_{1j}, Y_1)}{\rm{Va}r(X_{1j})}. \] Proof: From the linear model assumption we have \(\rm{Cov}(X_{1j},Y_1)=\sum_{k\in\{1,\dots,p\}}\beta_k^\ast \rm{Cov}(X_{1j},X_{1k})\).
Lemma (Bayes risk)
Under the Linear model assumption, and assuming \(Var(X_{ij})>0\) we have
In the following we write \[\mathbf{X}=\left(\begin{array}{ccc}X_{11} & \ldots & X_{1 p} \\ \vdots & \ddots & \vdots \\ X_{n 1} & \ldots & X_{n p}\end{array}\right), \quad \mathbf{Y}=\left(\begin{array}{c}Y_1 \\ \vdots \\ Y_n\end{array}\right), \quad \mathbf{\varepsilon}=\left(\begin{array}{c}\varepsilon_1 \\ \vdots \\ \varepsilon_n\end{array}\right)\]
Lemma (Least squares estimator)
Theorem (Excess risk Least squares estimator)
If \(\mathbf{X^{T}X}\) is invertible, then \[ \mathbb E[R\left(\hat{ m}^{LS}_n \right) | \mathbf X]-r\left(m_*\right)=\frac{\sigma^2}{n} \cdot \operatorname{tr}\left(\Sigma \hat{\Sigma}^{-1}\right), \] \(\Sigma=\mathbb E[XX^T], \hat \Sigma=n^{-1}\mathbf X^T\mathbf X\).
Up until now we have assumed that \(\mathbf{X}\) has full rank. This is not very realistic for very large \(p\) \((p>>n)\). Even if \(\mathbf{X^T}\mathbf{X}\) is invertible, the variance of the estimation error might be too large.
This leads us to penalized models that reduce variance by adding some bias.
Other alternatives (not discussed here) are dimension reduction, say via PCA, or feature selection, say forward stepwise regression.
Definition (Ridge regression)
Let \(\lambda \geq 0\) and \[J_\lambda(\beta)=\lambda\|\beta\|_2^2=\lambda \sum_{j=1}^p \beta_j^2 .\] The Ridge estimator is defined as \[\begin{aligned} \hat{\beta}_\lambda^{\text {ridge }} & =\underset{\beta \in \mathbb{R}^p}{\arg \min }\left\{\hat{R}_n(\beta)+J_\lambda(\beta)\right\} \\ & =\underset{\beta \in \mathbb{R}^p}{\arg \min }\left\{\|\mathbf{Y}-\mathbf{X} \beta\|_2^2+\lambda\|\beta\|_2^2\right\} . \end{aligned}\] The corresponding decision function is \[\hat{m}_{n, \lambda}^{\text {ridge }}(x)=\sum_{j=1}^p \hat{\beta}^\text{ridge}_{\lambda, j} x_j .\]
Lemma (Ridge regression solution)
Let \(\lambda>0\). Then \[ \hat{\beta}^\text{ridge}_\lambda=\left(\mathbf{X}^T \mathbf{X}+\lambda n I_{p \times p}\right)^{-1} \mathbf{X}^T \mathbf{Y} \] Proof. Building derivatives equal to zero for \(\beta \mapsto \hat{R}_n(\beta)+\lambda \cdot J(\beta)\).
In ridge regression, the matrix \(\mathbf{X}^T \mathbf{X}\) is ‘made invertible’ by adding a positive multiple of the identity matrix. Therefore, the ridge estimator also can be used in the case \(p>n\).
The name ‘ridge’ stems from the fact that the optimization problem is equivalent to \[ \min _{ \beta \in \mathbb{R}^p} \hat{R}_n(\beta) \quad \text { s.t. } \quad\|\beta\|_2 \leq t \] for some suitable \(t>0\).
Theorem (Excess risk for ridge regression estimator)
Under the linear model, \[ \mathbb E[ R\left(\hat{m}^{n,\text{ridge}}_\lambda \right)|X]-r\left(m^*\right)=\frac{\sigma^2}{n} \cdot \operatorname{tr}\left(\Sigma\left(\hat{\Sigma}+\lambda I_{p \times p}\right)^{-1} \hat{\Sigma}\left(\hat{\Sigma}+\lambda I_{p \times p}\right)^{-1}\right)+\lambda^2\left\|\Sigma^{1 / 2}\left(\hat{\Sigma}+\lambda I_{p \times p}\right)^{-1} \beta^*\right\|_2^2. \] Let \(\Sigma=U D U^T\) be the spectral decomposition of \(\Sigma\) with orthogonal matrix \(U\) and diagonal matrix \(D=\operatorname{diag}\left(s_1, \ldots, s_p\right.\) ) (entries are the eigenvalues of \(\Sigma\) ). By assuming \(\hat \Sigma= \Sigma\), the excess risk simplifies to \[ \frac{\sigma^2}{n} \sum_{j=1}^p \frac{s_j^2}{\left(s_j+\lambda\right)^2}+\lambda^2 \cdot \sum_{j=1}^p \frac{s_j\left(U^T \beta^*\right)_j^2}{\left(s_j+\lambda\right)^2}. \]
Remark (Excess risk for ridge regression estimator)
If all eigenvalues are equal, that is, \(s_j=s\) and if additionally \(\left(U^T \beta^*\right)_j=b(j=1, \ldots, p)\), then the expression of the theorem simplifies to \[ \frac{\sigma^2 p}{n} \cdot \frac{s^2}{(s+\lambda)^2}+\lambda^2 \frac{s b^2 p}{(s+\lambda)^2} \underset{\lambda=\frac{\sigma^2/n}{b^2}}{\stackrel{\min }{\rightarrow}} \frac{\sigma^2 p}{n} \cdot \frac{b^2 s}{\frac{\sigma^2}{n}+b^2 s} \leq \frac{\sigma^2 p}{n} . \] We see that for a suitable choice of the penalization parameter \(\lambda\), the excess Bayes risk of the ridge estimator can be smaller than the corresponding upper bound of the LS estimator.
If we believe that some covariates are pure noise, i.e., unrelated to \(Y\), the most obvious choice to penalize \(\beta\) would be of the form \(\|\beta\|_0=\#\{j=\) \(\left.{1, \ldots, p: \beta_j} \neq 0\right\}\). Then, one would simply penalize the number of non-zero entries of \(\beta\). However, this leads to an NP-hard optimization problems whose solutions are not accessible in practice. One therefore uses a different norm which has similar properties but leads to a convex optimization problem.
Definition (Lasso - Least absolute shrinkage and selection operator regression)
Let \(\lambda \geq 0\) and \[ J_\lambda(\beta)=\lambda \cdot\|\beta\|_1=\lambda \sum_{j=1}^p\left|\beta_j\right| . \] The LASSO estimator is given by \[ \begin{aligned} \hat{\beta}_\lambda^{\text {lasso }} & \in \underset{\beta \in \mathbb{R}^p}{\arg \min }\left\{\hat{R}_n(\beta)+J_\lambda(\beta)\right\} \\ & =\underset{\beta \in \mathbb{R}^p}{\arg \min }\left\{\frac{1}{n}\|\mathbf{Y}-\mathbf{X} \beta\|_2^2+\lambda \cdot\|\beta\|_1\right\} \end{aligned} \] The corresponding decision function reads \[=\hat{m}_{n, \lambda}^{l a s s o}(x)=\sum_{j=1}^p \hat{\beta}^{\text{lasso}}_{\lambda, j} x_j .\]
Definition
For \(\beta \in \mathbb{R}^p\), define \[S(\beta):=\left\{j \in\{1, \ldots, p\}: \beta_j \neq 0\right\} .\] For \(S \subset\{1, \ldots, p\}\) and \(v \in \mathbb{R}^p\), put \(v_S:=\left(v_j \mathbb{1}_{\{j \in S\}}\right)_{j=1, \ldots, p}\)
If \(p \ll n\), then \(\hat{\Sigma}\) would usually be invertible and the smallest eigenvalue (Rayleigh quotient) would satisfy \[\lambda_{\min }(\hat{\Sigma}):=\inf _{v \in \mathbb{R}^p} \frac{v^T \hat{\Sigma} v}{\|v\|_2^2}>0 .\]
Then \(\hat{\Sigma}\) would be one-to-one (injective) and the linear equation system \(\hat{\Sigma} \beta=\frac{1}{n} \mathbf{X}^T \mathbf{Y}\) would lead to the (unique) least squares estimator.
For \(p \gg n\), one has \(\lambda_{\min }(\hat{\Sigma})=0\).
When employing LASSO, we are usually only interested in estimators \(\hat{\beta}\) whith non-zero entries at the components \(S\left(\beta^*\right)\). This means that in principle we only need injectivity of \(\hat{\Sigma}\) on the set \[ \tilde{C}=\left\{\beta \in \mathbb{R}^p: S(\beta)=S\left(\beta^*\right)\right\}=\left\{\beta \in \mathbb{R}^p:\left\|\beta_{S\left(\beta^*\right)^ c}\right\|_1=0\right\}, \] or equivalently, \(\inf _{v \in \tilde{C}} \frac{v^T \hat{\Sigma} v}{\|v\|_2^2}=\inf _{v \in \tilde{C}} \frac{v^T \hat{\Sigma} v}{\left\|v_{S\left(\beta^*\right)}\right\|_2^2}>0 .\)
Definition (Restricted eigenvalue property (REP))
We say that the restricted eigenvalue property (REP) is satisfied with \((\gamma,\alpha)\) if for \[ C:=\left\{\beta \in \mathbb{R}^p:\left\|\beta_{S\left(\beta^*\right)^c}\right\|_1 \leq \alpha \left\|\beta_{S\left(\beta^*\right)}\right\|_1\right\} \] it holds that \[ \Lambda_{\min }(\Sigma):=\inf _{v \in C} \frac{v^T \Sigma v}{\left\|v_{S\left(\beta^*\right)}\right\|_2^2}>\gamma, \]
Theorem
Assume that \(\lambda \geq 4\left\|\frac{\mathbf{X}^{\mathrm{T}} \varepsilon}{n}\right\|_{\infty}\). If \(\beta^*\) is supported on a subset \(S(\beta^\ast)\) of cardinality \(s\), and the design matrix satisfies the REP property with \((\gamma , 3)\), then any solution of the LASSO loss, \(\widehat{\beta}=\hat \beta_{n,\lambda}^{LASSO}\), satisfies \[ \frac{\left\|\mathbf{X}\left(\widehat{\beta}-\beta^*\right)\right\|_2^2}{n} \leq \frac{9 s \lambda^2 } {4\gamma} . \]
Remark
Note that the error is a type of conditional excess risk where the algorithm is only evaluated on the training data: \[ \frac 1 n \sum_{i=1}^n \mathbb E[ L(Y, \hat m(X_i) ) | \mathcal D_n] = n^{-1}{\left\|\mathbf{X}\left(\widehat{\beta}-\beta^*\right)\right\|_2^2} + \sigma^2. \]
Remark
Hence \(\lambda = \sqrt{2}\sigma\sqrt{\frac{\tau \log p}{n}}\), is a valid choice with high probability leading to an upper bound of the prediction error \[ \frac{\left\|\mathbf{X}\left(\widehat{\beta}-\beta^*\right)\right\|_2^2}{n} \leq \sigma^2 s \tau \frac{\log p}{ 4 \gamma n}. \]
Remark
Interpretation: \(\hat{\beta}_\lambda\) behaves like the LS estimator in a model with \(s\) instead of \(p\) dimensions.
The LASSO estimator \(\hat{\beta}_\lambda\) has to ‘pay’ with a factor \(\log (p)\) for the missing insight which components are non-zero. This is a rather small price to pay even if \(p\) is large.
One can prove similar theoretical statements without the conditions \(\varepsilon \sim N\left(0, \sigma^2\right)\) and \(X\) deterministic and still can preserve the small \(\log (p)\) term.
Regarding the REP: The smallest eigenvalue \(\lambda_{\min }(\Sigma)\) measures how strongly the components of \(X\) are correlated. Note that a strong correlation of \(X\) is a problem for estimation of \(\beta^*\), but not for the excess risk itself: In the extreme case \(X_1=X_2\), it is clear that \(\hat{\beta}\) cannot distinguish the values of \(\beta_1^*\) and \(\beta_2^*\), but it can still provide good predictions through \(X \hat{\beta}\). Unfortunately, the proof technique underlying the theorem transfers the estimation quality of \(\beta^*\) to an upper bound of the excess risk, therefore this fact is not adequately represented in the result.
Corollary
Assume that \(\lambda \geq 4\left\|\frac{\mathbf{X}^{\mathrm{T}} \varepsilon}{n}\right\|_{\infty}\). If \(\beta^*\) is supported on a subset \(S\) of cardinality \(s\), and the design matrix satisfies the REP property with \((\gamma , 3)\), then any solution of the LASSO loss, \(\widehat{\beta}=\hat \beta_{n,\lambda}^{LASSO}\), satisfies
\[ \left\|\widehat{\beta}-\beta^*\right\|_2 \leq \frac{3}{2\gamma} \sqrt{s} \lambda \]
Definition
Definition
The design matrix \(\mathbf X\) satisfies the mutual incoherence condition with parameter \(\gamma\) if for \(S=S(\beta^\ast)\) \[ \max _{j \in S^\complement}\left\|\left(\mathbf{X}_S^T \mathbf{X}_S\right)^{-1} \mathbf{X}_S^T \mathbf{x}_j\right\|_1 \leq 1-\gamma \]
Heuristic for the mutual incoherence condition:
Theorem
Assume \(\mathbf{X}\)
Then there are constants \(c_1,c_2\) such that with probability greater than \(1-c_1 e^{-c_2 n \lambda^2}\), the LASSO estimator \(\widehat{\beta}=\hat \beta_{n,\lambda}^{LASSO}\) has the following properties:
Example
Assume that \(\rm{corr}(X_{1},X_2)\approx1\) and \(Y=0.5X_1+0.5X_2\). While Lasso will probably estimate \(\beta_1=1, \beta_2=0\), Ridge will probably estimate correctly \(\beta_1=0.5, \beta_2=0.5\).
Definition (Elastic net)
For \(\lambda>0, \alpha \in (0,1)\), and \(J_\lambda^{elastic.net}(\beta)= \lambda\sum_{j=1}^p \alpha |\beta_j| + (1-\alpha) \beta_j^2\), we call \[\begin{aligned} \hat{\beta}_\lambda^{\text {elastic.net }} & \in \underset{\beta \in \mathbb{R}^d}{\arg \min }\left\{\hat{R}_n(\beta)+J^{elastic.net}_\lambda(\beta)\right\} \\ \end{aligned}\]
elastic net estimator.