Lecture 1: Fundamentals on supervised learning

Munir Hiabu

Supervised Learning Definition and Examples

Definition (Supervised Learning)

Supervised learning is a field in machine learning that works with labeled data, i.e. data consisting of a set of features \(X\) and a response \(Y\). The goal is to learn a function that maps a given input to an output. For the purpose of these notes, data for this problem consists of \(n\) i.i.d. realisations \(\mathcal D_n=(X_i,Y_i)_{i=1,\dots,n}\) called training data with \((X_i,Y_i) \overset{D}{=} (X,Y)\).

Examples

\(X=\) handwritten text, Y= words and/or numbers
- E.g. automatic reading of the postcode from a mail envelope
\(X=\) picture, Y= a chosen set of categories
- E.g. cancer detection from a CT scan

\(X=\) computer text, Y= a chosen set of categories
- E.g. spam detection in emails
\(X=\) spread sheet, Y= real valued
- E.g. insurance pricing
\(X=\) spread sheet, Y= a chosen set of categories
- E.g. fraud detection

In this course, we will focus on tabular data, i.e. “\(X=\) spread sheet”.

Supervised Learning Assumptions

We assume that we are given \(n\) independent copies of random variables \((X,Y)\), \(\mathcal D_n=(X_i, Y_i)_{i=1,\dots,n}\) (training data).

Caution: When is the iid assumption a problem?

E.g:

Distribution shift: The distribution of \(X\) changes (e.g. over time or space)
Stochastic process/time series: Independence assumption is violated
The relationship between \(X\) an \(Y\) changes over time/space.

We assume that \(X_i \in \mathbb R^p=\mathcal X\) and \(Y_i \in \mathcal Y\), with

\(\mathcal Y = \mathbb R\) (regression) or
\(\mathcal Y = {1,\dots, K}\) (classification).

The goal is to find a good learning algorithm \(\hat m(\mathcal D_n)=\hat m_n: \mathcal X \mapsto \mathcal Y\).

Supervised Learning What is a good estimator?

Learning an estimator can be motivated by two distinct tasks:

Prediction: given a new observation \(X\), how might the corresponding \(Y\) look like?
- E.g. insurance pricing, weather forecast, etc.
Interpretation: What is the impact/influence of \(X_{11}\) on \(Y\)
- E.g. explaining an insurance customer the policy price, nudging/intervention strategies

Definitions

We assume that \((X,Y)\) lives on a probability space \((\Omega, \mathcal F, \mathcal P)\).

A decision function is a deterministic function \(m: \mathcal X \mapsto \mathcal Y\).
A loss function is a deterministic function \(L: \mathcal Y \times \mathcal Y \mapsto \mathbb R_{\geq 0}\).
The risk of a decision function \(m\) given a loss function \(L\) is \(r(m)=\mathbb E[L(Y,m(X))]\).

Supervised Learning What is a good estimator?

Examples

Quadratic loss function: \(L(y_1,y_2)=(y_1-y_2)^2\)
Poisson Deviance: \(L(y_1,y_2)=2\left(y_1\log {\frac {y_1}{y_2 }}-y_1+y_2 \right)\)
Binary loss function: \(L(y_1,y_2)=\Bbb 1(y_1\neq y_2)\)

Other popular loss functions are absolute loss, negative likelihood function (for any popular distribution), Huber loss, hinge loss…

Definition

Given a function space \(\mathcal G \subseteq \{m \mid m: \mathcal X \mapsto \mathcal Y\}\), we say that a decision function \(m^* \in \mathcal G\) is the Bayes rule (regression) or Bayes classifier (classification) if \[ r(m^*) = \inf \{r(m) \mid m \in \mathcal G\} \] We say that the quantity \(\inf \{r(m) \mid m \in \mathcal G\}\) is the Bayes risk. Note that the latter, as an infimum over a real set bounded from below, always exists, even if the Bayes rule does not.

Supervised Learning What is a good estimator?

Lemma

Assume \(Y\in L_2(\Omega, \mathcal F, P_{Y|X})\), then the decision function that minimizes the risk for the quadratic loss function is \[m^\ast=\underset{m}{\operatorname{argmin}} \mathbb E[\{Y-m(X)\}^2]=\mathbb E[Y|X=x].\]
Assume the \(\mathcal Y=\{1,\dots, K\}\), the decision function that minimizes the risk for the binary loss function satisfies \[m^\ast=\underset{m}{\operatorname{argmin}} \mathbb E[\Bbb 1(Y\neq m(X))]=\underset{m}{\operatorname{argmin}} \mathbb P[Y\neq m(X)]=\underset{k=1,\dots,K}{\operatorname{argmax}} \mathbb P(Y=k|X=x)\]

Supervised Learning What is a good estimator? Generalization Error

Definition (Conditional risk (of an algorithm))

Denote training data by \(\mathcal D_n=(X_i,Y_i)_{i=1,\dots,n}\). Given an estimator \(\hat m_n\) we call \[ R(\hat m_n)=\mathbb E[L(Y,\hat m_n(X))| \mathcal D_n] \] the conditional risk or conditional generalization error.

Definition (Risk (of an algorithm))

We call \[ r(\hat m_n)=\mathbb E[R(\hat m_n)], \] the (unconditional) risk or generalization error.

Supervised Learning What is a good estimator? Generalization Error

Remark

The conditional risk is conditional on the training data \(\mathcal D_n\).
- It evaluates how \(\hat m_n\) performs when trained on the given training data.
Since \(\mathcal D_n\) is assumed stochastic, one could imagine that \(\hat m_n\) is generally good in learning the relationship between \(X\) and \(Y,\) but is performing bad on the data \(\mathcal D_n\) at hand.
The unconditional risk arises by averaging over infinite possible training data.
One may be interested in the unconditional risk when discussing theoretical results about the best algorithm , while the conditional risk is probably the more interesting object in concrete applications with a given data-set.

Supervised Learning What is a good estimator? Excess risk and convergence rate

Definition (Excess risk)

Assume that the Bayes risk \(m^\ast\) exists. The conditional excess risk of an algorithm is

\[ R_e(\hat{m}_n) = R(\hat{m}_n) - r(m^*). \] The (unconditional) excess risk of an algorithm as \[ r_e(\hat{m}_n) = r(\hat{m}_n) - r(m^*). \]

Supervised Learning Excess risk decomposition

Usually, a machine algorithms can only learn from a fixed function class \[ \mathcal{G} \subset\{m: \mathcal{X} \rightarrow \mathcal{Y} \text { measurable}\}. \] In particular we may have \(m^* \notin \mathcal{G}\) and the excess risk can be decomposed into \[ R_e(\hat{m}_n) = R\left(\hat{m}_n\right)-r\left(m^*\right)=\underbrace{\left[R\left(\hat{m}_n\right)-\inf _{m \in \mathcal{G}} r(m)\right]}_{\text {estimation error }}+\underbrace{\left[\inf _{m \in \mathcal{G}} r(m)-r\left(m^*\right)\right]}_{\text {inductive bias (or approximation error)}} . \]

Heuristic: A large \(\mathcal G\) usually means small inductive bias but large estimation error and vice versa.

Supervised Learning Empirical risk minimizer/standard learner

Definition (Empirical risk and empirical risk minimizer)

Given training data \(\mathcal D_n\) and a loss function \(L\), we call \[ \hat R_n(m):= \sum_i L(Y_i,m(X_i)) \] empirical risk. Given an additional function class \(\mathcal G\), \[ \underset{m \in \mathcal G }{\operatorname{argmin}}\hat R_n(m)= \underset{m \in \mathcal G }{\operatorname{argmin}}\sum_i L(Y_i,m(X_i)) \] is called empirical risk minimzer (or standard learner).

Supervised Learning Penalty terms

For larger function classes \(\mathcal G\) the empirical risk minimizer might not be unique and possibly too noisy.

In this case one sometimes adds a penalty term \(J_\lambda: \mathcal G \mapsto \mathbb R_{\geq 0}\) that penalizes the complexity of \(m\) and minimizes the penalized empirical risk: \[ \underset{m \in \mathcal G }{\operatorname{argmin}}\hat{R}_{n, \lambda}:= \underset{m \in \mathcal G }{\operatorname{argmin}} \sum_i L(Y_i,m(X_i))+J_\lambda(m) \]

Supervised Learning Penalty terms: Strong duality

Under convexity conditions on \(J_\lambda\) and \(\hat R_n\) one can show that given data \(\mathcal D_n\) and a function space \(\mathcal G\), there exists a constant \(\eta(\mathcal D_n, \lambda)\) such that for \(\mathcal G_\eta\{m \in \mathcal G | J_{\lambda} (m) \leq \eta\}\), we have \[ \underset{m \in \mathcal G }{\operatorname{argmin}}\hat{R}_{n, \lambda}=\underset{m \in \mathcal G_\eta }{\operatorname{argmin}}\hat{R}_{n} \]

Examples (Penalty terms)

Penalty terms often used are

\(J_\lambda(m)= \lambda \int m''(x) \mathrm dx\)
\(J_\lambda(m)= \lambda \int |m(x)| \mathrm dx\)
\(J_\lambda(m)= \lambda \int (m(x))^2 \mathrm dx\)

Supervised Learning Probability bounds for the estimation error

Proposition

Let \({m}^\ast \in \arg \min _{m \in \mathcal{G}} r(m)\). We have \[ r\left(\hat{m}_n\right)-r({m}^\ast) \leq 2 \sup _{m \in \mathcal{G}}\left|\hat{R}_n(m)-r(m)\right|, \] and for all \(\lambda \in \Lambda\), \[ r\left(\hat{m}_{n, \lambda}\right)-r({m}^\ast) \leq 2 \sup _{m \in \mathcal{G}}\left|\hat{R}_{n, \lambda}(m)-r(m)\right|+J_\lambda({m}^\ast)-J_\lambda\left(\hat{m}_n\right) . \]

Concept: If, for a specific loss function and function class, one can obtain a uniform bound of \(\sup_{m \in \mathcal G}| \hat R_n(m)-R(m)|\), then the approximation error can be bounded.
Often this is only possible in probability:

Supervised Learning Probably approximately correct

Definition ( probably approximately correct (PAC))

We say \(\hat{m}_n\) is \(\epsilon\)-accurate with probability \(1-\delta\), if \[ P\left(r\left(\hat{m}_n\right)-r\left(m^\ast\right)>\epsilon\right)<\delta \]