When deciding which method too choose for a given task, one may like to pick the method with the smallest generalization error.
However, most machine learning methods depend on hyper parameters and one may first need to decide which hyper parameters are best for the given task.
Sometimes, we would also like to compare the generalization error of optimally tuned machine learning methods given our data.
Definition (training and test set)
One often randomly divides the given data into training data and test data:
Often \(n_1/n \in [0.8,0.95]\)
Definition (training error and test error)
The empirical risk on the training data is called training error and on the test data test error.
Definition (validation set)
To tune the hyper parameters for an algorithm, one often randomly divides the given training data into training data (yes…also called training data….) and validation data:
where \(\tau\) is a randomly picked permutation of \(\{1,\dots,n_1\}\), \(n_{11}\) the training set size, and \(n_{12}\) the validation set size; \(n_{11}+n_{12}=n_1\).
One can now compare different methods via the following simple algorithm:
Remarks
The procedure above is stochastic and has bias and variance both for selecting the optimal hyper parameters and for selecting the optimal method.
Variance can be reduced by repeating steps 1-4 several times. The most popular method for doing so is nested cross validation.
We want to estimate the generalization error of a method \(\hat m_{n,\lambda}\) that depends on a fixed hyper parameter \(\lambda\).
Definition (M-fold Cross-validation)
Given data \(\mathcal D_n\), K-fold cross validation follows the following steps:
Comments and further reading
A not so trivial (and partly open) question: What is the cross validation risk actually estimating?
While often not practical due to the computational time, noise can be reduced by performing K-fold cross-validation multiple times and averaging the obtained results.
The bias is minimal for large \(K\) since estimation is based on training the algorithm on \(n-(n/K)\) data points.
For \(K=n\), K-fold cross validation is also known as leave-one-out cross validation. It is however often not practical because of the computational cost.
\(K=5,10\) are common choices in practice.
When deciding on an optimal hyper parameter we would like to pick \[\underset{\lambda \in \Lambda}{\operatorname{argmin}} CV(\hat m_{n,\lambda}).\] But the hyper parameter space \(\Lambda\) is often multi-dimensional and partly continuous. Hence it is infeasible to try out all parameters.
Common practice are
On the previous slides we have discussed how we can use cross validation to pick an optimal parameter for a given algorithm and data set. But how to choose between different algorithm classes indexed by hyperparamters?
One popular way is nested cross validation comprising an inner loop for hyper parameter selection (tuning) and an outer loop for comparison between algorithm classes.
Assume that we want to compare \(J\) algorithms \(\hat m_{j}, j=1,\dots J\), each depending on some hyperparameters.
Definition (Nested \(K_1-K_2\) Cross-validation)
Given data \(\mathcal D_n\):
Divide the data into \(K_1\) disjoint sets \(S_1,\dots,S_{K_1}\) of same size. Define \(S_{-k}=\cup_{l \neq k} S_l\).
For \(k=1,\dots, K_1\), run \(K_2-\)fold cross validation on \(S_{-k}\) for all \(J\) methods, returning optimal hyper parameters \(\hat \lambda(j,k)\) \((j=1,\dots,J)\) (measure via cross validated empirical risk).
Calculate the cross validated empirical risk: \(CV(\hat m_{j})=K_1^{-1}\sum_{k=1}^{K_1}|S_{k}|^{-1} \sum_{i \in S_{k}} L(Y_i, \hat m_{\hat \lambda(j,k)}(S_{-k})(X_i))\).
Pick the method with the smallest risk (and possibly tune again for fitting).