Lecture 2: Benchmarking

Munir Hiabu

Benchmarking How to compare different methods

When deciding which method too choose for a given task, one may like to pick the method with the smallest generalization error.
However, most machine learning methods depend on hyper parameters and one may first need to decide which hyper parameters are best for the given task.
Sometimes, we would also like to compare the generalization error of optimally tuned machine learning methods given our data.

Benchmarking Definition and Examples

Definition (training and test set)

One often randomly divides the given data into training data and test data:

\(D_1=(X_i,Y_i)_{i=1\dots,n_1}\) (training data)
\(D_2=( X_i, Y_i)_{i=n_1+1\dots,n=n_1+n_2}\) (test data)

Often \(n_1/n \in [0.8,0.95]\)

Definition (training error and test error)

The empirical risk on the training data is called training error and on the test data test error.

Benchmarking Definition and Examples

Definition (validation set)

To tune the hyper parameters for an algorithm, one often randomly divides the given training data into training data (yes…also called training data….) and validation data:

\(D_{11}=(X_i,Y_i)_{i=\tau(1)\dots,\tau(n_{11})}\) (training data)
\(D_{12}=(\tilde X_i,\tilde Y_i)_{i=\tau(n_{11}+1)\dots,\tau(n_1)}\) (validation data),

where \(\tau\) is a randomly picked permutation of \(\{1,\dots,n_1\}\), \(n_{11}\) the training set size, and \(n_{12}\) the validation set size; \(n_{11}+n_{12}=n_1\).

Benchmarking How to compare different methods

One can now compare different methods via the following simple algorithm:

Split your data into train, validation and test set.
For a given method and a rich set of hyper parameter configurations train the method on the training set and compare performance via empirical risk on the validation set.
For every method, pick the hyper parameter with the smallest empirical risk.
Compare different methods with the chosen hyper parameters on the test set (trained on training+validation set) and pick the method with smallest empirical risk.

Benchmarking How to compare different methods

Remarks

The procedure above is stochastic and has bias and variance both for selecting the optimal hyper parameters and for selecting the optimal method.

Bias: Bias occurs because the sample sizes used for learning the hyper parameters (\(n_{11}\)) and also the last fitting on training+validation set of size \(n_1\) are smaller tha the actual full data size (\(n\)).
Variance: The results are stochastic because the validation and test set are not of infinite size.

Variance can be reduced by repeating steps 1-4 several times. The most popular method for doing so is nested cross validation.

Benchmarking M-fold Cross validation

We want to estimate the generalization error of a method \(\hat m_{n,\lambda}\) that depends on a fixed hyper parameter \(\lambda\).

Definition (M-fold Cross-validation)

Given data \(\mathcal D_n\), K-fold cross validation follows the following steps:

Divide the data into \(K\) disjoint sets \(S_1,\dots,S_K\) of same size. Define \(S_{-l}=\cup_{l \neq k} S_l\).
For \(k=1,\dots, K\), train your algorithm on \(S_{-k}\) and denote it \(\hat m_{\lambda}(S_{-k})\).
Calculate the cross validated empirical risk: \(CV(\hat m_{n,\lambda})=K^{-1}\sum_{k=1}^K|S_{k}|^{-1}\sum_{i \in S_{k}} L(Y_i, \hat m_{\lambda}(S_{-k})(X_i))\)

Benchmarking M-fold cross validation

Comments and further reading

A not so trivial (and partly open) question: What is the cross validation risk actually estimating?
- Conditional risk \(R(\hat m_n)\) or
- unconditional risk \(r(\hat m_n)\).
- Difficulty: The \(K\) summands in step 3 are correlated for \(K \geq 3\).For further reading see (see Bates, Hastie, and Tibshirani 2021) and references therein.
While often not practical due to the computational time, noise can be reduced by performing K-fold cross-validation multiple times and averaging the obtained results.
The bias is minimal for large \(K\) since estimation is based on training the algorithm on \(n-(n/K)\) data points.
For \(K=n\), K-fold cross validation is also known as leave-one-out cross validation. It is however often not practical because of the computational cost.
\(K=5,10\) are common choices in practice.

Benchmarking Hyper parameter search

When deciding on an optimal hyper parameter we would like to pick \[\underset{\lambda \in \Lambda}{\operatorname{argmin}} CV(\hat m_{n,\lambda}).\] But the hyper parameter space \(\Lambda\) is often multi-dimensional and partly continuous. Hence it is infeasible to try out all parameters.

Common practice are

Grid search
Random search (e.g. pick 200 parameters uniformly random from the parameter space)
Advanced optimization techniques (i.e. techniques that aim to find the minimzer of a function (here: cross validated empirical risk) without the requirement of knowing the analytical form of the function to optimize.
- General overview: Bischl et al. (2021)

Benchmarking Nested cross validation

On the previous slides we have discussed how we can use cross validation to pick an optimal parameter for a given algorithm and data set. But how to choose between different algorithm classes indexed by hyperparamters?
One popular way is nested cross validation comprising an inner loop for hyper parameter selection (tuning) and an outer loop for comparison between algorithm classes.
Assume that we want to compare \(J\) algorithms \(\hat m_{j}, j=1,\dots J\), each depending on some hyperparameters.

Definition (Nested \(K_1-K_2\) Cross-validation)

Given data \(\mathcal D_n\):

Divide the data into \(K_1\) disjoint sets \(S_1,\dots,S_{K_1}\) of same size. Define \(S_{-k}=\cup_{l \neq k} S_l\).
For \(k=1,\dots, K_1\), run \(K_2-\)fold cross validation on \(S_{-k}\) for all \(J\) methods, returning optimal hyper parameters \(\hat \lambda(j,k)\) \((j=1,\dots,J)\) (measure via cross validated empirical risk).
Calculate the cross validated empirical risk: \(CV(\hat m_{j})=K_1^{-1}\sum_{k=1}^{K_1}|S_{k}|^{-1} \sum_{i \in S_{k}} L(Y_i, \hat m_{\hat \lambda(j,k)}(S_{-k})(X_i))\).
Pick the method with the smallest risk (and possibly tune again for fitting).

References

Bates, Stephen, Trevor Hastie, and Robert Tibshirani. 2021. “Cross-Validation: What Does It Estimate and How Well Does It Do It?” arXiv Preprint arXiv:2104.00673.

Bischl, Bernd, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, et al. 2021. “Hyperparameter Optimization: Foundations, Algorithms, Best Practices, and Open Challenges.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1484.