Lecture 4: Nonparametric regression

Munir Hiabu

Nonparametric regression

  • Linear models are quite restrictive and one may ask how one can achieve more flexibility.

  • In this lecture we will look at the nonparametric regression problem with squared loss \(L(y_1,y_2)=(y_1-y_2)^2\)

  • We already know that the Bayes-rule is \(m^\ast(x)=\mathbb E[Y|X=x]\)

Nonparametric regression k-nearest-neighbour (knn)

Definition k-nearest-neighbour

The k-nearest-neighbor estimator is \[\hat m^{knn}(x)= \frac 1 k \sum_{i \in \mathcal N_k(x)}Y_i,\] where \(\mathcal N_k(x)\subseteq \{1,\dots,n\}\) contains the indices of the \(k\) closest points of \(\{X_1, \dots X_n\}\) to \(x\).

Definition linear smoother

An estimator is called linear smoother if it can be written as \[\hat m(x)= \sum_i w_i(x)Y_i,\] where the weight function \(w_i\) can depend on \(\{X_1, \dots, X_n\}\).

Nonparametric regression k-nearest-neighbour (knn)

Example (k-nearest-neighbour)

The k-nearest-neighbor estimator is a linear smoother: \[\hat m^{knn}(x)= \sum_i^n w_i(x_i) Y_i,\] with \[w_i(x)=\begin{cases}\frac 1 k & X_i \ \text{belongs to the}\ k \ \text{closest points to}\ x \\ 0 & else\end{cases}\]

Nonparametric regression k-nearest-neighbour (knn)

Proposition (MSE k-nearest-neighbour)

Assume the \(E[Y|X=x]=m^\ast(x)\in \mathcal G_L = \{m: \mathbb R^p \mapsto \mathbb R| m \ \text{is L-Lipschitz continuous}\}\), and \(\textrm{Var}(Y|X=x)=\sigma^2(x)\leq \sigma^2.\) Then \[\mathbb E[(\hat m^{knn}(x)-m^\ast(x))^2]\leq (cL)^2 \left(\frac k n \right)^{2/p}+\frac {\sigma^2}k.\] In particular, for \(k_n=O_p( n^{2/(2+p)})\), we get \[\mathbb E[(\hat m^{knn}(x)-m^\ast(x))^2]=O_p(n^{-2/(2+p)}).\]

  • Note that this is slower than the “parametric” MSE of \(pn^{-1}\)
  • In particular it now grows exponentially in \(p\)

Nonparametric regression k-nearest-neighbour (knn)

Proof MSE k-nearest-neighbour

Write \(\mathbf X=X_1,\dots, X_n\) and denote by \(Y_i^{(x)}\) the \(i\)th closest \(Y\) to \(x\) among \(Y_1,\dots Y_n\). \[\begin{aligned}\mathbb E[(\hat m^{knn}(x)-m^\ast(x))^2]& =\mathbb E \Big [{\left(\mathbb{E}[\hat m^{knn}(x)|\mathbf X]-m^\ast(x)\right)^2}\Big]+{\mathbb{E}\Big[(\hat m^{knn}(x)-\mathbb{E}[\hat m^{knn}(x)|\mathbf X])^2\Big]}\\ &=\mathbb E\left[ \Big \{\frac{1}{k} \sum_{i \in \mathcal{N}_k(x)}\left(m^\ast\left(X_i\right)-m^\ast(x)\right)\Big \}^2\right] \\ &\quad + \frac 1 {k^2} \mathbb E \left[ \sum_i^k \sum_j^k \{Y_i^{(x)}- \mathbb E[Y_i^{(x)}| \mathbf X]\}\{Y_j^{(x)}- \mathbb E[Y_j^{(x)}| \mathbf X]\} \right] \\ & \leq \mathbb E\left[ \left(\frac{L}{k} \sum_{i \in \mathcal{N}_k(x)}\left\|X_i-x\right\|_2\right)^2 \right]+\frac{\sigma^2}{k} \\ &\leq^{(1)} L^2 c \left(\frac k n \right)^{2/p}+\frac {\sigma^2}k\\ \end{aligned}\]

Nonparametric regression kernel smoother

  • We have learned that the knn estimator can be written as \[\hat m^{knn}(x)= \frac 1 k w_i(x_i) Y_i,\]

  • Note that \(w_i(x)\) is not smooth as a function of \(x\). This also makes the estimator not smooth. Given that we assume that \(m^\ast\) is smooth, this may not be desirable.

  • An alternative are kernel smoothers.

Nonparametric regression kernel smoother

Definition kernel smoother

The kernel smoother a linear smoother with \[\hat m^{ks}(x)= \sum_i w_i(x_i) Y_i,\] where \[w_i(x)=\frac{K\left(\frac{||x-X_i||}{h}\right)}{\sum_{j=1}^n K\left(\frac{||x-X_j||}{h}\right)}\] Often \(K=\prod_j k_j\), such that \(K\left(\frac{||x-X_i||}{h}\right)=\prod_j k(x-X_{ij})\). The function \(k: \mathbb R \mapsto \mathbb R\) is usually a symmetric density function.

Nonparametric regression kernel smoother

Proposition (MSE kernel smoother)

Assume the \(E[Y|X=x]=m^\ast(x)\in \mathcal G_L = \{m: \mathbb R^p \mapsto \mathbb R| m \ \text{is L-Lipschitz continuous}\}\), and \(\textrm{Var}(Y|X=x)=\sigma^2(x)\leq \sigma^2.\) Then \[\mathbb E[(\hat m^{ks}(x)-m^\ast(x))^2]= O_p\left( \frac{1}{nh^p} + h^2 \right)\] In particular, for \(h_n=O_p(n^{-1/(2+p)})\), we get \[\mathbb E[(\hat m^{ks}(x)-m^\ast(x))^2]=O_p(n^{-2/(2+p)}).\]

Nonparametric regression Curse of dimensionality

  • We have seen that under a Lipschitz condition, both kernel smoother and knn have an asymptotic mean squared error of order \(n^{-2/(2+p)}\).

  • One can show that under the assumption that \(m^\ast\) is twice continuously differentiable, the rate for both methods can be improved to \[n^{-4/(4+p)}.\]

  • But this rate is still exponentially decreasing in \(p\). Furthermore, it has been shown that no method can do better under the given assumptions.

Nonparametric regression Curse of dimensionality

Here an example:

 

We first generate 100 observations of the \(x_1\) variable which is uniformly distributed on [0,10].

A total of 34 observations are within a distance of 2 units from the middle point 5

We generate \(x_2\) independent of \(x_1\) and uniformly distributed on \([0,10]\).

We find 12 observations within a distance of 2 units from the middle point (5,5).

The \(x_3\) variable is independent of \((x_1,x_2)\) and uniformly distributed on \([0,10]\).

Only a total of 1 observation(s) fall within a distance of 2 units from the middle point (5,5,5).

Nonparametric regression Curse of dimensionality

  • A new observation \(x_0\) will have very few or no observations in its neighborhood
    • This leads to high variance
    • and high bias when increasing the size of the neighborhood
  • Under the model of the previous slide, the setting \(n=50,p=1\) has the same expected amount of observations in a neighbourhood as the setting \(n=7.5\times10^{110}, p=100\).

Nonparametric regression Curse of dimensionality

 

There are two ways to tackle the curse of dimensionality


1. Sparsity

Assume that the intrinsic dimension is lower

  • E.g. Not all variables are relevant
  • Or feature engineer a few highly predictive variables

2. Structure

Interactions are limited and structure can be exploited

  • e.g. an additive structure \(m(x)=m_1(x_1)+m_2(x_2).\)

Interpretability

Structure is also essential for interpretability.

Nonparametric regression Splines

  • We want to establish a framework to estimate additive regression functions.
  • To this end, assume \(p=1\) until further notice.

Definition (Splines)

A \(k\)th-order spline with \(l\) knotpoints \(x_1 <\dots<x_l\)

  • is a polynomial of degree \(k\) on each interval \((−\infty,x_1],[x_1,x_2]\dots,[x_l,\infty)\)
  • has continuous derivatives of orders \(0,...,k-1\) on the knotpoints \(x_1 <\dots<x_l\)

Nonparametric regression Splines

Examples of splines basis

A \(k\)th-order spline \(m\) with \(l\) knotpoints can an be uniquely written as \[m(x)=\sum_{j=1}^{k+1+l}\theta_jg_j(x)\]

  • truncated power basis
    • For \(j=0,\dots, k\): \(g_{j}=x^{j}\)
    • For \(j=k+1,\dots,k+l:\) \(g_{j}=(x-x_{j-k})^k_+\)
      • \((x)_+:= \max(x,0)\)
  • B-splines
    • more complicated, but computaitonally more robust and faster to compute.

Nonparametric regression Natural splines

  • Splines (used for regression, see next slide) have high variance at the boundaries.
  • Solution: Let the piecewise polynomial function have a lower degree at \((−\infty,x_1],[x_l,\infty)\).

Definition (Natural splines)

A \(k\)th-order natural spline (\(k=\) odd number) with knotpoints \(x_1 <\dots<x_l\)

  • is a polynomial of degree \(k\) on the intervals \([x_1,x_2]\dots,[x_{l-1},x_l]\)
  • is a polynomial of degree \((k-1)/2\) on \((−\infty,x_1],[x_l,\infty)\)
  • has continuous derivatives of orders \(0,...,k-1\) on the knotpoints \(x_1 <\dots<x_l\)
  • Note that natural splines have dimension \(l+1\) which is in particular independent of the order \(k\) (compare to dimension \(k+l+1\) for splines) .
  • There is also a truncated power basis and a B-splines basis for natural splines

Nonparametric regression Linear regression with splines

  • We still assume the one-dimensional case, \(p=1\).

  • Instead of looking at observations \((X_i,Y_i)_{i=1,\dots,n}\) we can consider a natural splines basis and look at observations \((g_1(X_i), \dots, g_{l+1}(X_i), Y_i)_{i=1,\dots,n}.\)

  • By doing so we are able to approximate any natural spline in \(x\) instead of just linear functions in \(x\), while still being in a linear regression framework, i.e., \[\hat \beta = {\arg \min }_\beta \sum_i (Y_i - { G_i^T}\beta )^2= (\mathbf G^T\mathbf G)^{-1}\mathbf G^T\mathbf Y\] where \(G_i^T=(g_1(X_i), \dots, g_{l+1}(X_i))\), the rows of \(\mathbf G\).

  • Problem: How do we choose the number of knotpoints \(l\) and their position?

    • First thought: Cross validation. But that would be quite expensive to run.

Nonparametric regression Smoothing splines

  • Let’s look at the following minimization problem: \[\hat m= \arg\min_{m} \sum_i (Y_i-m(X_i))^2+\lambda\int_a^b m''(x)^2\mathrm dx,\] where:
    • minimization runs over all twice times differentiable functions \(m\)
    • observations \(X_i\) are in \([a,b]\) for all \(i\)

Nonparametric regression Smoothing splines

Theorem (smoothing splines)

If \(m\) is twice differentiable and the solution to \[\arg\min_{m} \sum_i (Y_i-m(X_i))^2+\lambda\int_a^b m''(x)^2\mathrm dx,\] then \(m\) is a natural spline of order 3 (natural cubic spline).

Nonparametric regression Smoothing splines

Proof

Given a minimizer \(\tilde m\), we construct the unique natural cubic spline \(m\) on \([a,b]\) with knotpoints \(X_1,\dots, X_n\) and \(m(X_i)=\tilde m(X_i)\). We will show that \(\tilde m=m\). Define \(h=\tilde m-m\). Note that \(h(X_j)=0 (j=1,\dots,n)\), \(m'''(x)=0\) for \(x<X_1\) and \(x>X_n\) as well as \(m^{(4)}=0\). Hence by applying integration by parts twice we get \[\begin{aligned} \int_a^b m''(x)h''(x)\mathrm dx &= -\int_{X_1}^{X_n}m'''(x)h'(x)\mathrm dx \\ &= -\sum_{j=1}^{n-1}m'''(x)h(x)\Big|_{X_j}^{X_{j+1}} \\ &=0 \end{aligned}\] This implies \[\int_a^b \tilde m''(x)^2\mathrm dx= \int_a^b \{m''(x)+h(x)\}^2\mathrm dx=\int_a^b m''(x)^2+h(x)^2\mathrm dx.\] meaning \[\int_a^b m''(x)^2\mathrm dx \leq \int_a^b \tilde m''(x)^2\mathrm dx.\] with equality only for \(h''=0\), i.e, \(h\) linear. Since \(h(X_i)=0 (i=1,\dots n)\), we have \(h=0\) and so \(\tilde m=m\).

Nonparametric regression Smoothing splines as generalized ridge regression

We can conclude the following:

  • Let \(\mathbf G\) be the matrix with with rows \(G_i=(g_1(X_i), \dots, g_{n+1}(X_i))\).
    • \(\{g_j\}_{j=1,\dots,n+1}\) is a basis for natural cubic splines with knots \(X_1, \dots X_n\).
  • Define

\[\hat \beta = \arg\min_{\beta} \sum_i (Y_i- G_i^T\beta)^2+\lambda\int_0^1 \left\{{\sum_j \beta_jg_j''}(x)\right\}^2\mathrm dx.\]

  • Then, \[\sum_j \hat \beta_j g_j=\arg\min_{m} \sum_i (Y_i-m(X_i))^2+\lambda\int_0^1 m''(x)^2\mathrm dx.\]

Nonparametric regression Smoothing splines as generalized ridge regression

Smoothing splines can be seen as a special case of generalized ridge regression:

  • Write \(\mathbf W_{ij}=\int_0^1g_i''(x)g_j''(x)\mathrm dx,\) then

\[\begin{aligned} &\arg\min_{\beta} \sum_i (Y_i- G_i^T\beta)^2+\lambda\int_0^1 \left\{{\sum_j \beta_jg_j''}(x)\right\}^2\mathrm dx\\= &\arg\min_{\beta} \sum_i (Y_i- G_i^T\beta)^2+\lambda \beta^T \mathbf W \beta. \end{aligned}\]

  • Hence, \[\hat \beta= (\mathbf G^T\mathbf G+\lambda \mathbf W)^{-1}\mathbf G^T\mathbf Y.\]

Nonparametric regression Smoothing splines equivalence to kernel smoothing

  • It can be shown the smoothing splines are asymptotically equivalent to kernel smoothers with varying bandwidth and a specific choice of kernel, see Silverman (1984) or Wang, Du, and Shen (2013) for a more recent contribution.

  • In general, smoothing splines are more practical since they can be efficiently calculated while kernel smoothers are much easier to analyze theoretically.

Nonparametric regression Additive models

  • We have seen that fully nonparametric methods suffer from the curse of dimensionality:

    • the optimal rate of convergence for twice continuously differentiable functions is \(n^{-4/(4+p)}\).
  • One solution is to restrict oneself to the class of additive functions \[\mathcal G=\{m| m(x)=m_1(x_1)+\cdots +m_p(x_p)\}\]

  • Stone (1985) showed that the components \(m_k\), if twice continuously differentiable, can be estimated with one-dimensional rate of \(n^{-4/4+1}\)

  • The components \(m_j\) are usually estimated via the so called backfitting algorithm (Hastie and Tibshirani 1990).

Nonparametric regression Additive models: backfitting

Backffitting comprises the following two steps:

Backffiting Algorithm

  • Intialize: \(\hat m_j^{[0]}=0, j=1,\dots,p\)
  • Iterate for \(r=1,\dots\)
    • Iterate for \(j=1,\dots,p\)
      1. Residuals: \(r_{ij}^{[r]}=Y_i-\sum_{k < j} \hat{m}_{k}^{[r]}(x_{ik})-\sum_{k > j} \hat{m}_{k}^{[r-1]}(x_{ik})\).
      2. Smooth: \(\hat{m}_j^{[r]}=\operatorname{Smooth}\left(\left\{X_{ij},r_{ij}^{[r]}\}_{i=1,\dots,n}\right\}\right).\)
      3. Center: \(\hat{m}_j^{[r]}=\hat{m}_j^{[r]}-\frac{1}{n} \sum_{i=1}^n \hat{m}_j^{[r]}\left(X_{i j}\right)\).
  • Note that Smooth is a one-dimensional regression problem.
  • In practice Smooth is most often a smoothing spline.
    • It has been shown backfitting via smoothing splines achieve a rate of order \(p n^{-4/4+1}\).
  • Problem: Additive methods are still not optimal in the case of sparsity (i.e. some features being not relevant) and interactions between features.

References

Györfi, László, Michael Köhler, Adam Krzyżak, and Harro Walk. 2002. A Distribution-Free Theory of Nonparametric Regression. Vol. 1. Springer.
Hastie, T. J., and R. J. Tibshirani. 1990. Generalized Additive Models. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis. https://books.google.dk/books?id=qa29r1Ze1coC.
Silverman, Bernard W. 1984. “Spline Smoothing: The Equivalent Variable Kernel Method.” The Annals of Statistics, 898–916.
Stone, Charles J. 1985. “Additive Regression and Other Nonparametric Models.” The Annals of Statistics 13 (2): 689–705.
Wang, Xiao, Pang Du, and Jinglai Shen. 2013. “Smoothing Splines with Varying Smoothing Parameter.” Biometrika 100 (4): 955–70.