Definition (Supervised Learning)
Supervised learning is a field in machine learning that works with labeled data, i.e. data consisting of a set of features \(X\) and a response \(Y\). The goal is to learn a function that maps a given input to an output. For the purpose of these notes, data for this problem consists of \(n\) i.i.d. realisations \(\mathcal D_n=(X_i,Y_i)_{i=1,\dots,n}\) called training data with \((X_i,Y_i) \overset{D}{=} (X,Y)\).
In this course, we will focus on tabular data, i.e. “\(X=\) spread sheet”.
We assume that we are given \(n\) independent copies of random variables \((X,Y)\), \(\mathcal D_n=(X_i, Y_i)_{i=1,\dots,n}\) (training data).
Caution: When is the iid assumption a problem?
E.g:
We assume that \(X_i \in \mathbb R^p=\mathcal X\) and \(Y_i \in \mathcal Y\), with
The goal is to find a good learning algorithm \(\hat m(\mathcal D_n)=\hat m_n: \mathcal X \mapsto \mathcal Y\).
Learning an estimator can be motivated by two distinct tasks:
Definitions
We assume that \((X,Y)\) lives on a probability space \((\Omega, \mathcal F, \mathcal P)\).
Examples
Other popular loss functions are absolute loss, negative likelihood function (for any popular distribution), Huber loss, hinge loss…
Definition
Given a function space \(\mathcal G \subseteq \{m \mid m: \mathcal X \mapsto \mathcal Y\}\), we say that a decision function \(m^* \in \mathcal G\) is the Bayes rule (regression) or Bayes classifier (classification) if \[ r(m^*) = \inf \{r(m) \mid m \in \mathcal G\} \] We say that the quantity \(\inf \{r(m) \mid m \in \mathcal G\}\) is the Bayes risk. Note that the latter, as an infimum over a real set bounded from below, always exists, even if the Bayes rule does not.
Lemma
Definition (Conditional risk (of an algorithm))
Denote training data by \(\mathcal D_n=(X_i,Y_i)_{i=1,\dots,n}\). Given an estimator \(\hat m_n\) we call \[ R(\hat m_n)=\mathbb E[L(Y,\hat m_n(X))| \mathcal D_n] \] the conditional risk or conditional generalization error.
Definition (Risk (of an algorithm))
We call \[ r(\hat m_n)=\mathbb E[R(\hat m_n)], \] the (unconditional) risk or generalization error.
Remark
The conditional risk is conditional on the training data \(\mathcal D_n\).
Since \(\mathcal D_n\) is assumed stochastic, one could imagine that \(\hat m_n\) is generally good in learning the relationship between \(X\) and \(Y,\) but is performing bad on the data \(\mathcal D_n\) at hand.
The unconditional risk arises by averaging over infinite possible training data.
One may be interested in the unconditional risk when discussing theoretical results about the best algorithm
, while the conditional risk is probably the more interesting object in concrete applications with a given data-set.
Definition (Excess risk)
Assume that the Bayes risk \(m^\ast\) exists. The conditional excess risk of an algorithm is
\[ R_e(\hat{m}_n) = R(\hat{m}_n) - r(m^*). \] The (unconditional) excess risk of an algorithm as \[ r_e(\hat{m}_n) = r(\hat{m}_n) - r(m^*). \]
Usually, a machine algorithms can only learn from a fixed function class \[ \mathcal{G} \subset\{m: \mathcal{X} \rightarrow \mathcal{Y} \text { measurable}\}. \] In particular we may have \(m^* \notin \mathcal{G}\) and the excess risk can be decomposed into \[ R_e(\hat{m}_n) = R\left(\hat{m}_n\right)-r\left(m^*\right)=\underbrace{\left[R\left(\hat{m}_n\right)-\inf _{m \in \mathcal{G}} r(m)\right]}_{\text {estimation error }}+\underbrace{\left[\inf _{m \in \mathcal{G}} r(m)-r\left(m^*\right)\right]}_{\text {inductive bias (or approximation error)}} . \]
Definition (Empirical risk and empirical risk minimizer)
Given training data \(\mathcal D_n\) and a loss function \(L\), we call \[ \hat R_n(m):= \sum_i L(Y_i,m(X_i)) \] empirical risk. Given an additional function class \(\mathcal G\), \[ \underset{m \in \mathcal G }{\operatorname{argmin}}\hat R_n(m)= \underset{m \in \mathcal G }{\operatorname{argmin}}\sum_i L(Y_i,m(X_i)) \] is called empirical risk minimzer (or standard learner).
For larger function classes \(\mathcal G\) the empirical risk minimizer might not be unique and possibly too noisy.
In this case one sometimes adds a penalty term \(J_\lambda: \mathcal G \mapsto \mathbb R_{\geq 0}\) that penalizes the complexity of \(m\) and minimizes the penalized empirical risk: \[ \underset{m \in \mathcal G }{\operatorname{argmin}}\hat{R}_{n, \lambda}:= \underset{m \in \mathcal G }{\operatorname{argmin}} \sum_i L(Y_i,m(X_i))+J_\lambda(m) \]
Under convexity conditions on \(J_\lambda\) and \(\hat R_n\) one can show that given data \(\mathcal D_n\) and a function space \(\mathcal G\), there exists a constant \(\eta(\mathcal D_n, \lambda)\) such that for \(\mathcal G_\eta\{m \in \mathcal G | J_{\lambda} (m) \leq \eta\}\), we have \[ \underset{m \in \mathcal G }{\operatorname{argmin}}\hat{R}_{n, \lambda}=\underset{m \in \mathcal G_\eta }{\operatorname{argmin}}\hat{R}_{n} \]
Examples (Penalty terms)
Penalty terms often used are
Proposition
Let \({m}^\ast \in \arg \min _{m \in \mathcal{G}} r(m)\). We have \[ r\left(\hat{m}_n\right)-r({m}^\ast) \leq 2 \sup _{m \in \mathcal{G}}\left|\hat{R}_n(m)-r(m)\right|, \] and for all \(\lambda \in \Lambda\), \[ r\left(\hat{m}_{n, \lambda}\right)-r({m}^\ast) \leq 2 \sup _{m \in \mathcal{G}}\left|\hat{R}_{n, \lambda}(m)-r(m)\right|+J_\lambda({m}^\ast)-J_\lambda\left(\hat{m}_n\right) . \]
Concept: If, for a specific loss function and function class, one can obtain a uniform bound of \(\sup_{m \in \mathcal G}| \hat R_n(m)-R(m)|\), then the approximation error can be bounded.
Often this is only possible in probability:
Definition ( probably approximately correct (PAC))
We say \(\hat{m}_n\) is \(\epsilon\)-accurate with probability \(1-\delta\), if \[ P\left(r\left(\hat{m}_n\right)-r\left(m^\ast\right)>\epsilon\right)<\delta \]