Lecture 3: Linear Models

Munir Hiabu

Linear Models

In this lecture we will look at the linear regression problem with squared loss \(L(y_1,y_2)=(y_1-y_2)^2\)
We already know that the Bayes-rule is \(m^\ast(x)=\mathbb E[Y|X=x]\)

Linear Models Assumptions

Linear Model Assumption

There are parameters \(\beta_0^\ast, \beta_1^\ast, \dots \beta_p^\ast\), with \[ m^\ast(x)= \beta_0^\ast + \sum_{j=1}^p \beta_j^\ast x_j. \] In other words, we assume that \(m^\ast\) is a linear function, i.e., \(m^\ast \in \mathcal G=\{f: \ \mathbb R^{p}\mapsto \mathbb R| \ f(x)=\beta_0^+ \sum_{j=1}^p \beta_j x_j,\ \beta_0,\dots,\beta_p\in \mathbb R \}\).

First Implication of Linear Model Assumption

Given iid training data \((X_i,Y_i)\), \(i=1,\dots,n\), we have an additive noise model \[ Y_i= \beta_0^\ast + \sum_{j=1}^p \beta_j^\ast X_{ij}+ \varepsilon_i, \] with \(\varepsilon_i=Y_i-m^{\ast}(X_i)\), and hence iid with \(\mathbb E[\varepsilon_i|X_i]=0.\)

Linear Models Assumptions

By assuming \(E[Y|X=x] \in \mathcal G\)= linear functions we only consider estimation error and ignore inductive bias/approximation error
An alternative is to consider \(\beta^\ast= \Sigma^{-1} \mathbb E[YX], \Sigma=\mathbb E[XX^T]\), i.e., the best linear predictor without assuming that \(\mathbb E[Y|X]\) is linear.

Linear Models Coefficients

Lemma (Coefficients in linear model)

Under the Linear model assumption, we have for \(j=1,\dots,p\) \[ \beta_j^\ast= \frac{\rm{Cov}(X_{1j}, Y_1-\sum_{k\in\{1,\dots,p\}\setminus \{j\}}\beta_k^\ast X_{1k})}{\rm{Var}(X_{1j})}. \] In particular, if the components of \(X\) are uncorrelated, we have \[ \beta_j^\ast= \frac{\rm{Cov}(X_{1j}, Y_1)}{\rm{Va}r(X_{1j})}. \] Proof: From the linear model assumption we have \(\rm{Cov}(X_{1j},Y_1)=\sum_{k\in\{1,\dots,p\}}\beta_k^\ast \rm{Cov}(X_{1j},X_{1k})\).

Linear Models Bayes risk

Lemma (Bayes risk)

Under the Linear model assumption, and assuming \(Var(X_{ij})>0\) we have

\(r(m^\ast)=\rm{Var}(\varepsilon_i)\)
For \(m(x)=\beta_0 + \sum_{j=1}^p \beta_j x_j\), \[ r(m)-r(m^\ast)= ||\Sigma^{1/2}(\beta-\beta^\ast)||_2^2, \quad \Sigma=\mathbb E[XX^T]. \]

Linear Models Least squares estimator

In the following we write \[\mathbf{X}=\left(\begin{array}{ccc}X_{11} & \ldots & X_{1 p} \\ \vdots & \ddots & \vdots \\ X_{n 1} & \ldots & X_{n p}\end{array}\right), \quad \mathbf{Y}=\left(\begin{array}{c}Y_1 \\ \vdots \\ Y_n\end{array}\right), \quad \mathbf{\varepsilon}=\left(\begin{array}{c}\varepsilon_1 \\ \vdots \\ \varepsilon_n\end{array}\right)\]

For notational convenience, we will assume that \(\beta_0^\ast =0\)
The empirical risk has the form \(\hat{R}_n(m)=\frac{1}{n}\|\mathbf{Y}-\mathbf{X} \boldsymbol \beta\|_2^2\).

Lemma (Least squares estimator)

It holds that \(\left(\mathbf{X}^T \mathbf{X}\right) \hat{\boldsymbol \beta}=\mathbf{X}^T \mathbf{Y}\) (normal equation).
If \(\mathbf{X}\) has full rank, then \[ \hat{\boldsymbol \beta} ^{LS}_n=\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{Y} \]

Linear Models Least squares estimator

Theorem (Excess risk Least squares estimator)

If \(\mathbf{X^{T}X}\) is invertible, then \[ \mathbb E[R\left(\hat{ m}^{LS}_n \right) | \mathbf X]-r\left(m_*\right)=\frac{\sigma^2}{n} \cdot \operatorname{tr}\left(\Sigma \hat{\Sigma}^{-1}\right), \] \(\Sigma=\mathbb E[XX^T], \hat \Sigma=n^{-1}\mathbf X^T\mathbf X\).

Hence if \(\hat{\Sigma} \approx \Sigma\), then \(\frac{\sigma^2}{n} \cdot \operatorname{tr}\left(\Sigma \hat{\Sigma}^{-1}\right) \approx \frac{\sigma^2 p}{n}\).
- This approximation does not take into account the variation of \(X\). Due to the inverse, it is not easily possible to derive an upper bound for the expectation of \(\hat{\Sigma}^{-1}\). Therefore, we only obtain a result for the excess Bayes risk which holds with high probability and under additional assumptions (which could be relaxed but would lead to much more complicated proofs).

Linear Models Least squares estimator

Up until now we have assumed that \(\mathbf{X}\) has full rank. This is not very realistic for very large \(p\) \((p>>n)\). Even if \(\mathbf{X^T}\mathbf{X}\) is invertible, the variance of the estimation error might be too large.
This leads us to penalized models that reduce variance by adding some bias.
Other alternatives (not discussed here) are dimension reduction, say via PCA, or feature selection, say forward stepwise regression.

Linear Models Ridge regression

Definition (Ridge regression)

Let \(\lambda \geq 0\) and \[J_\lambda(\beta)=\lambda\|\beta\|_2^2=\lambda \sum_{j=1}^p \beta_j^2 .\] The Ridge estimator is defined as \[\begin{aligned} \hat{\beta}_\lambda^{\text {ridge }} & =\underset{\beta \in \mathbb{R}^p}{\arg \min }\left\{\hat{R}_n(\beta)+J_\lambda(\beta)\right\} \\ & =\underset{\beta \in \mathbb{R}^p}{\arg \min }\left\{\|\mathbf{Y}-\mathbf{X} \beta\|_2^2+\lambda\|\beta\|_2^2\right\} . \end{aligned}\] The corresponding decision function is \[\hat{m}_{n, \lambda}^{\text {ridge }}(x)=\sum_{j=1}^p \hat{\beta}^\text{ridge}_{\lambda, j} x_j .\]

Linear Models Ridge regression

Lemma (Ridge regression solution)

Let \(\lambda>0\). Then \[ \hat{\beta}^\text{ridge}_\lambda=\left(\mathbf{X}^T \mathbf{X}+\lambda n I_{p \times p}\right)^{-1} \mathbf{X}^T \mathbf{Y} \] Proof. Building derivatives equal to zero for \(\beta \mapsto \hat{R}_n(\beta)+\lambda \cdot J(\beta)\).

Linear Models Ridge regression

In ridge regression, the matrix \(\mathbf{X}^T \mathbf{X}\) is ‘made invertible’ by adding a positive multiple of the identity matrix. Therefore, the ridge estimator also can be used in the case \(p>n\).
The name ‘ridge’ stems from the fact that the optimization problem is equivalent to \[ \min _{ \beta \in \mathbb{R}^p} \hat{R}_n(\beta) \quad \text { s.t. } \quad\|\beta\|_2 \leq t \] for some suitable \(t>0\).

Linear Models Ridge regression

Theorem (Excess risk for ridge regression estimator)

Under the linear model, \[ \mathbb E[ R\left(\hat{m}^{n,\text{ridge}}_\lambda \right)|X]-r\left(m^*\right)=\frac{\sigma^2}{n} \cdot \operatorname{tr}\left(\Sigma\left(\hat{\Sigma}+\lambda I_{p \times p}\right)^{-1} \hat{\Sigma}\left(\hat{\Sigma}+\lambda I_{p \times p}\right)^{-1}\right)+\lambda^2\left\|\Sigma^{1 / 2}\left(\hat{\Sigma}+\lambda I_{p \times p}\right)^{-1} \beta^*\right\|_2^2. \] Let \(\Sigma=U D U^T\) be the spectral decomposition of \(\Sigma\) with orthogonal matrix \(U\) and diagonal matrix \(D=\operatorname{diag}\left(s_1, \ldots, s_p\right.\) ) (entries are the eigenvalues of \(\Sigma\) ). By assuming \(\hat \Sigma= \Sigma\), the excess risk simplifies to \[ \frac{\sigma^2}{n} \sum_{j=1}^p \frac{s_j^2}{\left(s_j+\lambda\right)^2}+\lambda^2 \cdot \sum_{j=1}^p \frac{s_j\left(U^T \beta^*\right)_j^2}{\left(s_j+\lambda\right)^2}. \]

Linear Models Ridge regression

Remark (Excess risk for ridge regression estimator)

If all eigenvalues are equal, that is, \(s_j=s\) and if additionally \(\left(U^T \beta^*\right)_j=b(j=1, \ldots, p)\), then the expression of the theorem simplifies to \[ \frac{\sigma^2 p}{n} \cdot \frac{s^2}{(s+\lambda)^2}+\lambda^2 \frac{s b^2 p}{(s+\lambda)^2} \underset{\lambda=\frac{\sigma^2/n}{b^2}}{\stackrel{\min }{\rightarrow}} \frac{\sigma^2 p}{n} \cdot \frac{b^2 s}{\frac{\sigma^2}{n}+b^2 s} \leq \frac{\sigma^2 p}{n} . \] We see that for a suitable choice of the penalization parameter \(\lambda\), the excess Bayes risk of the ridge estimator can be smaller than the corresponding upper bound of the LS estimator.

Linear Models Lasso regression

If we believe that some covariates are pure noise, i.e., unrelated to \(Y\), the most obvious choice to penalize \(\beta\) would be of the form \(\|\beta\|_0=\#\{j=\) \(\left.{1, \ldots, p: \beta_j} \neq 0\right\}\). Then, one would simply penalize the number of non-zero entries of \(\beta\). However, this leads to an NP-hard optimization problems whose solutions are not accessible in practice. One therefore uses a different norm which has similar properties but leads to a convex optimization problem.

Linear Models Lasso regression

Definition (Lasso - Least absolute shrinkage and selection operator regression)

Let \(\lambda \geq 0\) and \[ J_\lambda(\beta)=\lambda \cdot\|\beta\|_1=\lambda \sum_{j=1}^p\left|\beta_j\right| . \] The LASSO estimator is given by \[ \begin{aligned} \hat{\beta}_\lambda^{\text {lasso }} & \in \underset{\beta \in \mathbb{R}^p}{\arg \min }\left\{\hat{R}_n(\beta)+J_\lambda(\beta)\right\} \\ & =\underset{\beta \in \mathbb{R}^p}{\arg \min }\left\{\frac{1}{n}\|\mathbf{Y}-\mathbf{X} \beta\|_2^2+\lambda \cdot\|\beta\|_1\right\} \end{aligned} \] The corresponding decision function reads \[=\hat{m}_{n, \lambda}^{l a s s o}(x)=\sum_{j=1}^p \hat{\beta}^{\text{lasso}}_{\lambda, j} x_j .\]

There exists no easy closed-form solution for \(\hat \beta^{\text{lasso}}_{\lambda}\) besides some special cases.

Linear Models Lasso regression

Definition

For \(\beta \in \mathbb{R}^p\), define \[S(\beta):=\left\{j \in\{1, \ldots, p\}: \beta_j \neq 0\right\} .\] For \(S \subset\{1, \ldots, p\}\) and \(v \in \mathbb{R}^p\), put \(v_S:=\left(v_j \mathbb{1}_{\{j \in S\}}\right)_{j=1, \ldots, p}\)

If \(p \ll n\), then \(\hat{\Sigma}\) would usually be invertible and the smallest eigenvalue (Rayleigh quotient) would satisfy \[\lambda_{\min }(\hat{\Sigma}):=\inf _{v \in \mathbb{R}^p} \frac{v^T \hat{\Sigma} v}{\|v\|_2^2}>0 .\]
Then \(\hat{\Sigma}\) would be one-to-one (injective) and the linear equation system \(\hat{\Sigma} \beta=\frac{1}{n} \mathbf{X}^T \mathbf{Y}\) would lead to the (unique) least squares estimator.
For \(p \gg n\), one has \(\lambda_{\min }(\hat{\Sigma})=0\).

Linear Models Lasso regression

When employing LASSO, we are usually only interested in estimators \(\hat{\beta}\) whith non-zero entries at the components \(S\left(\beta^*\right)\). This means that in principle we only need injectivity of \(\hat{\Sigma}\) on the set \[ \tilde{C}=\left\{\beta \in \mathbb{R}^p: S(\beta)=S\left(\beta^*\right)\right\}=\left\{\beta \in \mathbb{R}^p:\left\|\beta_{S\left(\beta^*\right)^ c}\right\|_1=0\right\}, \] or equivalently, \(\inf _{v \in \tilde{C}} \frac{v^T \hat{\Sigma} v}{\|v\|_2^2}=\inf _{v \in \tilde{C}} \frac{v^T \hat{\Sigma} v}{\left\|v_{S\left(\beta^*\right)}\right\|_2^2}>0 .\)

Definition (Restricted eigenvalue property (REP))

We say that the restricted eigenvalue property (REP) is satisfied with \((\gamma,\alpha)\) if for \[ C:=\left\{\beta \in \mathbb{R}^p:\left\|\beta_{S\left(\beta^*\right)^c}\right\|_1 \leq \alpha \left\|\beta_{S\left(\beta^*\right)}\right\|_1\right\} \] it holds that \[ \Lambda_{\min }(\Sigma):=\inf _{v \in C} \frac{v^T \Sigma v}{\left\|v_{S\left(\beta^*\right)}\right\|_2^2}>\gamma, \]

Linear Models Lasso regression

Theorem

Assume that \(\lambda \geq 4\left\|\frac{\mathbf{X}^{\mathrm{T}} \varepsilon}{n}\right\|_{\infty}\). If \(\beta^*\) is supported on a subset \(S(\beta^\ast)\) of cardinality \(s\), and the design matrix satisfies the REP property with \((\gamma , 3)\), then any solution of the LASSO loss, \(\widehat{\beta}=\hat \beta_{n,\lambda}^{LASSO}\), satisfies \[ \frac{\left\|\mathbf{X}\left(\widehat{\beta}-\beta^*\right)\right\|_2^2}{n} \leq \frac{9 s \lambda^2 } {4\gamma} . \]

Remark

Note that the error is a type of conditional excess risk where the algorithm is only evaluated on the training data: \[ \frac 1 n \sum_{i=1}^n \mathbb E[ L(Y, \hat m(X_i) ) | \mathcal D_n] = n^{-1}{\left\|\mathbf{X}\left(\widehat{\beta}-\beta^*\right)\right\|_2^2} + \sigma^2. \]

Linear Models Lasso regression

Remark

In the case of a fixed design, i.e. \(\mathbf X\) deterministic, and Gaussian noise with \(Var(\varepsilon)=\sigma^2\), one can show via the Gaussian tail bound that for any \(\tau >2\) \[ \mathbb{P}\left[\frac{\left\|\mathbf{X} {\varepsilon}\right\|_{\infty}}{n} \geq \sigma \sqrt{\frac{\tau \log p}{n}} \right] \leq 2 e^{-\frac{1}{2}(\tau-2) \log p}. \]

Hence \(\lambda = \sqrt{2}\sigma\sqrt{\frac{\tau \log p}{n}}\), is a valid choice with high probability leading to an upper bound of the prediction error \[ \frac{\left\|\mathbf{X}\left(\widehat{\beta}-\beta^*\right)\right\|_2^2}{n} \leq \sigma^2 s \tau \frac{\log p}{ 4 \gamma n}. \]

Linear Models Lasso regression

Remark

Interpretation: \(\hat{\beta}_\lambda\) behaves like the LS estimator in a model with \(s\) instead of \(p\) dimensions.
The LASSO estimator \(\hat{\beta}_\lambda\) has to ‘pay’ with a factor \(\log (p)\) for the missing insight which components are non-zero. This is a rather small price to pay even if \(p\) is large.
One can prove similar theoretical statements without the conditions \(\varepsilon \sim N\left(0, \sigma^2\right)\) and \(X\) deterministic and still can preserve the small \(\log (p)\) term.
Regarding the REP: The smallest eigenvalue \(\lambda_{\min }(\Sigma)\) measures how strongly the components of \(X\) are correlated. Note that a strong correlation of \(X\) is a problem for estimation of \(\beta^*\), but not for the excess risk itself: In the extreme case \(X_1=X_2\), it is clear that \(\hat{\beta}\) cannot distinguish the values of \(\beta_1^*\) and \(\beta_2^*\), but it can still provide good predictions through \(X \hat{\beta}\). Unfortunately, the proof technique underlying the theorem transfers the estimation quality of \(\beta^*\) to an upper bound of the excess risk, therefore this fact is not adequately represented in the result.

Linear Models Lasso regression

Corollary

Assume that \(\lambda \geq 4\left\|\frac{\mathbf{X}^{\mathrm{T}} \varepsilon}{n}\right\|_{\infty}\). If \(\beta^*\) is supported on a subset \(S\) of cardinality \(s\), and the design matrix satisfies the REP property with \((\gamma , 3)\), then any solution of the LASSO loss, \(\widehat{\beta}=\hat \beta_{n,\lambda}^{LASSO}\), satisfies

\[ \left\|\widehat{\beta}-\beta^*\right\|_2 \leq \frac{3}{2\gamma} \sqrt{s} \lambda \]

Linear Models Lasso regression

Definition

If \(P(S(\widehat{\beta})=S(\beta)) \rightarrow 1\) we call \(\hat \beta\) sparsistent.

Definition

The design matrix \(\mathbf X\) satisfies the mutual incoherence condition with parameter \(\gamma\) if for \(S=S(\beta^\ast)\) \[ \max _{j \in S^\complement}\left\|\left(\mathbf{X}_S^T \mathbf{X}_S\right)^{-1} \mathbf{X}_S^T \mathbf{x}_j\right\|_1 \leq 1-\gamma \]

Linear Models Lasso regression

Heuristic for the mutual incoherence condition:

\((\mathbf{X}_S^T \mathbf{X}_S)^{-1} \mathbf{X}_S^T \mathbf{x}_j\) is the least squares regression parameter when predicting \(\mathbf x_j\) given data \(\mathbf X_S\).
Hence, if the columns of \(X_S\) are orthogonal to all \(\mathbf x_j\), \(j\in S^\complement\), then all parameters are zero and \(\gamma=1\).
However orthogonality is not possible in the high dimensional case, \(p >>n\). The mutual incoherence condition provides a bound on the dependency.

Linear Models Lasso regression

Theorem

Assume \(\mathbf{X}\)

is normalized (i.e. columns with empirical mean zero and variance one)
satisfies the mutual incoherence condition with parameter \(\gamma>0\).
satisfies REP condition with \((K,0)\)

Then there are constants \(c_1,c_2\) such that with probability greater than \(1-c_1 e^{-c_2 n \lambda^2}\), the LASSO estimator \(\widehat{\beta}=\hat \beta_{n,\lambda}^{LASSO}\) has the following properties:

Uniqueness: The optimal solution \(\widehat{\beta}\) is unique.
No false inclusion: The unique optimal solution has its support contained within the true support \[S(\widehat{\beta}) \subseteq S\left(\beta^*\right).\]
\(\ell_{\infty}\)-bounds: The error \(\widehat{\beta}-\beta^*\) satisfies the \(\ell_{\infty}\) bound \[ \left\|\widehat{\beta}_S-\beta_S^*\right\|_{\infty} \leq \underbrace{\lambda\left[\frac{4 \sigma}{\sqrt{s}}+\left\|\hat \Sigma^{-1}\right\|_{\infty}\right]}_{B\left(\lambda, \sigma ; \mathbf{X}\right)} \]
No false exclusion: The LASSO solution includes all indices \(j \in S\left(\beta^*\right)\) such that \(\left|\beta_j^*\right|>B\left(\lambda, \sigma ; \mathbf{X}\right)\), and hence is sparsistent as long as \(\min _{j \in S}\left|\beta_j^*\right|>B\left(\lambda, \sigma ; \mathbf{X}\right)\).

Linear Models Elastic net

Lasso and Ridge regression aim for different things.
Ridge regression reduces variance of the estimator by restricting the function space. Heuristically, the smaller \(||\beta^\ast||_2\) the more helpful is ridge regression.
Lasso is useful in sparse setting, i.e, if one wishes to eliminate components/features.

Example

Assume that \(\rm{corr}(X_{1},X_2)\approx1\) and \(Y=0.5X_1+0.5X_2\). While Lasso will probably estimate \(\beta_1=1, \beta_2=0\), Ridge will probably estimate correctly \(\beta_1=0.5, \beta_2=0.5\).

Linear Models Elastic net

Definition (Elastic net)

For \(\lambda>0, \alpha \in (0,1)\), and \(J_\lambda^{elastic.net}(\beta)= \lambda\sum_{j=1}^p \alpha |\beta_j| + (1-\alpha) \beta_j^2\), we call \[\begin{aligned} \hat{\beta}_\lambda^{\text {elastic.net }} & \in \underset{\beta \in \mathbb{R}^d}{\arg \min }\left\{\hat{R}_n(\beta)+J^{elastic.net}_\lambda(\beta)\right\} \\ \end{aligned}\]

elastic net estimator.