Linear models are quite restrictive and one may ask how one can achieve more flexibility.
In this lecture we will look at the nonparametric regression problem with squared loss \(L(y_1,y_2)=(y_1-y_2)^2\)
We already know that the Bayes-rule is \(m^\ast(x)=\mathbb E[Y|X=x]\)
Definition k-nearest-neighbour
The k-nearest-neighbor estimator is \[\hat m^{knn}(x)= \frac 1 k \sum_{i \in \mathcal N_k(x)}Y_i,\] where \(\mathcal N_k(x)\subseteq \{1,\dots,n\}\) contains the indices of the \(k\) closest points of \(\{X_1, \dots X_n\}\) to \(x\).
Definition linear smoother
An estimator is called linear smoother if it can be written as \[\hat m(x)= \sum_i w_i(x)Y_i,\] where the weight function \(w_i\) can depend on \(\{X_1, \dots, X_n\}\).
Example (k-nearest-neighbour)
The k-nearest-neighbor estimator is a linear smoother: \[\hat m^{knn}(x)= \sum_i^n w_i(x_i) Y_i,\] with \[w_i(x)=\begin{cases}\frac 1 k & X_i \ \text{belongs to the}\ k \ \text{closest points to}\ x \\ 0 & else\end{cases}\]
Proposition (MSE k-nearest-neighbour)
Assume the \(E[Y|X=x]=m^\ast(x)\in \mathcal G_L = \{m: \mathbb R^p \mapsto \mathbb R| m \ \text{is L-Lipschitz continuous}\}\), and \(\textrm{Var}(Y|X=x)=\sigma^2(x)\leq \sigma^2.\) Then \[\mathbb E[(\hat m^{knn}(x)-m^\ast(x))^2]\leq (cL)^2 \left(\frac k n \right)^{2/p}+\frac {\sigma^2}k.\] In particular, for \(k_n=O_p( n^{2/(2+p)})\), we get \[\mathbb E[(\hat m^{knn}(x)-m^\ast(x))^2]=O_p(n^{-2/(2+p)}).\]
Proof MSE k-nearest-neighbour
Write \(\mathbf X=X_1,\dots, X_n\) and denote by \(Y_i^{(x)}\) the \(i\)th closest \(Y\) to \(x\) among \(Y_1,\dots Y_n\). \[\begin{aligned}\mathbb E[(\hat m^{knn}(x)-m^\ast(x))^2]& =\mathbb E \Big [{\left(\mathbb{E}[\hat m^{knn}(x)|\mathbf X]-m^\ast(x)\right)^2}\Big]+{\mathbb{E}\Big[(\hat m^{knn}(x)-\mathbb{E}[\hat m^{knn}(x)|\mathbf X])^2\Big]}\\ &=\mathbb E\left[ \Big \{\frac{1}{k} \sum_{i \in \mathcal{N}_k(x)}\left(m^\ast\left(X_i\right)-m^\ast(x)\right)\Big \}^2\right] \\ &\quad + \frac 1 {k^2} \mathbb E \left[ \sum_i^k \sum_j^k \{Y_i^{(x)}- \mathbb E[Y_i^{(x)}| \mathbf X]\}\{Y_j^{(x)}- \mathbb E[Y_j^{(x)}| \mathbf X]\} \right] \\ & \leq \mathbb E\left[ \left(\frac{L}{k} \sum_{i \in \mathcal{N}_k(x)}\left\|X_i-x\right\|_2\right)^2 \right]+\frac{\sigma^2}{k} \\ &\leq^{(1)} L^2 c \left(\frac k n \right)^{2/p}+\frac {\sigma^2}k\\ \end{aligned}\]
We have learned that the knn estimator can be written as \[\hat m^{knn}(x)= \frac 1 k w_i(x_i) Y_i,\]
Note that \(w_i(x)\) is not smooth as a function of \(x\). This also makes the estimator not smooth. Given that we assume that \(m^\ast\) is smooth, this may not be desirable.
An alternative are kernel smoothers.
Definition kernel smoother
The kernel smoother a linear smoother with \[\hat m^{ks}(x)= \sum_i w_i(x_i) Y_i,\] where \[w_i(x)=\frac{K\left(\frac{||x-X_i||}{h}\right)}{\sum_{j=1}^n K\left(\frac{||x-X_j||}{h}\right)}\] Often \(K=\prod_j k_j\), such that \(K\left(\frac{||x-X_i||}{h}\right)=\prod_j k(x-X_{ij})\). The function \(k: \mathbb R \mapsto \mathbb R\) is usually a symmetric density function.
Proposition (MSE kernel smoother)
Assume the \(E[Y|X=x]=m^\ast(x)\in \mathcal G_L = \{m: \mathbb R^p \mapsto \mathbb R| m \ \text{is L-Lipschitz continuous}\}\), and \(\textrm{Var}(Y|X=x)=\sigma^2(x)\leq \sigma^2.\) Then \[\mathbb E[(\hat m^{ks}(x)-m^\ast(x))^2]= O_p\left( \frac{1}{nh^p} + h^2 \right)\] In particular, for \(h_n=O_p(n^{-1/(2+p)})\), we get \[\mathbb E[(\hat m^{ks}(x)-m^\ast(x))^2]=O_p(n^{-2/(2+p)}).\]
We have seen that under a Lipschitz condition, both kernel smoother and knn have an asymptotic mean squared error of order \(n^{-2/(2+p)}\).
One can show that under the assumption that \(m^\ast\) is twice continuously differentiable, the rate for both methods can be improved to \[n^{-4/(4+p)}.\]
But this rate is still exponentially decreasing in \(p\). Furthermore, it has been shown that no method can do better under the given assumptions.
Here an example:
We first generate 100 observations of the \(x_1\) variable which is uniformly distributed on [0,10].
A total of 34 observations are within a distance of 2 units from the middle point 5
We generate \(x_2\) independent of \(x_1\) and uniformly distributed on \([0,10]\).
We find 12 observations within a distance of 2 units from the middle point (5,5).
The \(x_3\) variable is independent of \((x_1,x_2)\) and uniformly distributed on \([0,10]\).
3D plot
Only a total of 1 observation(s) fall within a distance of 2 units from the middle point (5,5,5).
There are two ways to tackle the curse of dimensionality
Assume that the intrinsic dimension is lower
Interactions are limited and structure can be exploited
Interpretability
Structure is also essential for interpretability.
Definition (Splines)
A \(k\)th-order spline with \(l\) knotpoints \(x_1 <\dots<x_l\)
Examples of splines basis
A \(k\)th-order spline \(m\) with \(l\) knotpoints can an be uniquely written as \[m(x)=\sum_{j=1}^{k+1+l}\theta_jg_j(x)\]
Definition (Natural splines)
A \(k\)th-order natural spline (\(k=\) odd number) with knotpoints \(x_1 <\dots<x_l\)
We still assume the one-dimensional case, \(p=1\).
Instead of looking at observations \((X_i,Y_i)_{i=1,\dots,n}\) we can consider a natural splines basis and look at observations \((g_1(X_i), \dots, g_{l+1}(X_i), Y_i)_{i=1,\dots,n}.\)
By doing so we are able to approximate any natural spline in \(x\) instead of just linear functions in \(x\), while still being in a linear regression framework, i.e., \[\hat \beta = {\arg \min }_\beta \sum_i (Y_i - { G_i^T}\beta )^2= (\mathbf G^T\mathbf G)^{-1}\mathbf G^T\mathbf Y\] where \(G_i^T=(g_1(X_i), \dots, g_{l+1}(X_i))\), the rows of \(\mathbf G\).
Problem: How do we choose the number of knotpoints \(l\) and their position?
Theorem (smoothing splines)
If \(m\) is twice differentiable and the solution to \[\arg\min_{m} \sum_i (Y_i-m(X_i))^2+\lambda\int_a^b m''(x)^2\mathrm dx,\] then \(m\) is a natural spline of order 3 (natural cubic spline).
Proof
Given a minimizer \(\tilde m\), we construct the unique natural cubic spline \(m\) on \([a,b]\) with knotpoints \(X_1,\dots, X_n\) and \(m(X_i)=\tilde m(X_i)\). We will show that \(\tilde m=m\). Define \(h=\tilde m-m\). Note that \(h(X_j)=0 (j=1,\dots,n)\), \(m'''(x)=0\) for \(x<X_1\) and \(x>X_n\) as well as \(m^{(4)}=0\). Hence by applying integration by parts twice we get \[\begin{aligned} \int_a^b m''(x)h''(x)\mathrm dx &= -\int_{X_1}^{X_n}m'''(x)h'(x)\mathrm dx \\ &= -\sum_{j=1}^{n-1}m'''(x)h(x)\Big|_{X_j}^{X_{j+1}} \\ &=0 \end{aligned}\] This implies \[\int_a^b \tilde m''(x)^2\mathrm dx= \int_a^b \{m''(x)+h(x)\}^2\mathrm dx=\int_a^b m''(x)^2+h(x)^2\mathrm dx.\] meaning \[\int_a^b m''(x)^2\mathrm dx \leq \int_a^b \tilde m''(x)^2\mathrm dx.\] with equality only for \(h''=0\), i.e, \(h\) linear. Since \(h(X_i)=0 (i=1,\dots n)\), we have \(h=0\) and so \(\tilde m=m\).
We can conclude the following:
\[\hat \beta = \arg\min_{\beta} \sum_i (Y_i- G_i^T\beta)^2+\lambda\int_0^1 \left\{{\sum_j \beta_jg_j''}(x)\right\}^2\mathrm dx.\]
Smoothing splines can be seen as a special case of generalized ridge regression:
\[\begin{aligned} &\arg\min_{\beta} \sum_i (Y_i- G_i^T\beta)^2+\lambda\int_0^1 \left\{{\sum_j \beta_jg_j''}(x)\right\}^2\mathrm dx\\= &\arg\min_{\beta} \sum_i (Y_i- G_i^T\beta)^2+\lambda \beta^T \mathbf W \beta. \end{aligned}\]
It can be shown the smoothing splines are asymptotically equivalent to kernel smoothers with varying bandwidth and a specific choice of kernel, see Silverman (1984) or Wang, Du, and Shen (2013) for a more recent contribution.
In general, smoothing splines are more practical since they can be efficiently calculated while kernel smoothers are much easier to analyze theoretically.
We have seen that fully nonparametric methods suffer from the curse of dimensionality:
One solution is to restrict oneself to the class of additive functions \[\mathcal G=\{m| m(x)=m_1(x_1)+\cdots +m_p(x_p)\}\]
Stone (1985) showed that the components \(m_k\), if twice continuously differentiable, can be estimated with one-dimensional rate of \(n^{-4/4+1}\)
The components \(m_j\) are usually estimated via the so called backfitting algorithm (Hastie and Tibshirani 1990).
Backffitting comprises the following two steps:
Backffiting Algorithm