Interpretable Machine Learning

Assignment I

Author

Munir Eberhardt Hiabu

Published

March 19, 2024

Part 1 (Training, Validating and Testing)

We will work with the “credit-g” dataset. The dataset classifies people described by a set of attributes as good or bad credit risks. See https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data for more details. We will fetch the data from https://www.openml.org/d/31. We will rely on the mlr3 environment (https://mlr3.mlr-org.com/environment). As classifier, we will use logistic regression with elastic net.

Load necessary packages.

library(mlr3)
library(mlr3learners )
library(mlr3tuning)
library(mlr3mbo)
library(glmnet)
library(OpenML)
library(mlr3pipelines)

### If parallelizationis wanted also: 
library(future)
future::plan("multisession") 
(a) (Default fitting)

Fetch the data and create a task.

credit_data = getOMLDataSet(data.id = 31)
task = as_task_classif(credit_data$data, target = "class") 
  • Split your data into a training and test set.

  • Build a graph where

    • first: Dummy encode variables via po(“encode”),
    • second: Standardize via po(“scale”),
    • third: Logistic Regression with default settings is applied.
  • Train your model on the training set and evaluate it on the test set.

(b) (Cross Validation)

In this part we want to train a logistic regression with elastic net. The tuning parameters are \(\texttt{alpha}\) and \(\texttt{s}\) (Here: \(\texttt{s}\) has the same function as the penalty parameter \(\texttt{lambda}\)).

  • Build a new graph

    • First and second step as before in Exercise I
    • Third: Use learner classif.glmnet with tunable parameters:
    s = to_tune(0, 1)
    alpha = to_tune(0, 1)

    Note: Tuning \(s\) manually via cross-validation is not optimal with respect to computational efficiency and a more efficient solution is implemented directly in the classif.glmnet learner. In this exercise, for learning purposes, we will do cross-validation manually.

  • Tune your hyperparameters via the tune function with:

    • tuner: random search
    • resampling: 5-fold cross validation
    • measure: classification error
    • terminator: 50 evaluations
  • What is the CV error of the best configuration?

  • What is the test error of the best configuration?

    • It can be helpful to use
    graph_learner_elastic_net$param_set$values = instance$result_learner_param_vals
    graph_learner_elastic_net$param_set$values$classif.glmnet.lambda = graph_learner_elastic_net$param_set$values$classif.glmnet.s

    Here graph_learner_elastic_net is the name of my graph and instance is the name of my tuned instance.

  • Print the beta values via

graph_learner_elastic_net$model$classif.glmnet$model$beta
  • You may want to try out different tuning configurations
    • e.g. tuner: “mbo” or other measures.
(c) (Nested Cross Validation)
  • Try out nested cross-validation by
    • defining your graph as an auto_tuner with
      • tuner: random search,
      • resampling: 5-fold cross validation,
      • measure: classification error,
      • terminator: 50 evaluations.
    • Use resample to run 5-fold cross validation.
    • Print out the nested cross validation error.

Part 2 (Frisch–Waugh–Lovell theorem)

In this part we will study the Frisch–Waugh–Lovell theorem. The implied algorithm, but using machine learning instead of linear regression, has been introduced a couple of years ago as Double Machine Learning. Since then it has gained a lot of attention and popularity. Here, we want to verify the result via a coding exercise.

Load necessary packages. ::: {.cell}

library(mlr3)
library(mlr3learners)
library(OpenML)
library(mlr3pipelines)

:::

Fetch and edit data.

bike_data = getOMLDataSet(data.id = 42713)
bike_data$data <- bike_data$data[,-c(7,13,14)] ## remove casual and registered as sum of the two is count. also remove working day due to collinearity.

### convert dates to factors 
bike_data$data$year <- factor(bike_data$data$year)
bike_data$data$month <- factor(bike_data$data$month)
bike_data$data$hour <- factor(bike_data$data$hour)
bike_data$data$weekday <- factor(bike_data$data$weekday)
  1. Run simple least squares linear regression with response being count and predictor equal windspeed. Report the coefficient you get.

  2. Run least squares linear regression with response being count and predictor being all remaining variables. Don’t forget to create a graph to dummy-encode the factor variables. Report the coefficient you get for windspeed.

  3. Do the following steps

  • Run least squares linear regression with response being count and predictor being all remaining variables except windspeed. Calculate the residuals and call that variable count_residuals.
  • Run least squares linear regression with response being windspeed and predictor being all remaining variables except count. Calculate the residuals and call that variable windpseed_residuals.
  • Run simple least squares linear regression with response being count_residuals and predictor windpseed_residuals.
  • Report the regression coefficient you get.
  1. Verify that the coefficients in Steps (b) and (c) are the same.

  2. Replace the simple linear regression model in the second last step in part (c) by an auto-tuned k-nearest neighbors. Visualize the fit (by plotting windpseed_residuals against observed and predicted count_residuals) and compare it to the previous simple linear regression fit. Discuss the result.

Part 3 (Tree based models)

Load necessary packages.

library(mlr3)
library(mlr3learners)
library(mlr3tuning)
library(OpenML)
library(mlr3pipelines)
library(future)
future::plan("multisession") 

Fetch data. This is the same data is in Part 1.

# load credit-g data and define task
credit_data = getOMLDataSet(data.id = 31)
task = as_task_classif(credit_data$data, target = "class") 
(a)

Use the learner classif.rpart with predict_type = "prob" and train it on the task. Visualize the learned tree via

# load credit-g data and define task
full_tree_trained <- full_tree$model$classif.rpart$model
plot(full_tree_trained , compress = TRUE, margin = 0.1)
text(full_tree_trained , use.n = TRUE, cex = 0.8)

Here full_tree is the graph you trained.

(b)

We now aim to find a penalty parameter \(\alpha\) that results in a pruned tree with strong predictive power. To this end, we define a tree learner that runs the weakest link algorithm and thereafter 5-fold cross validation to compare the performance between different trees.

# load credit-g data and define task
my_cart_learner_cv = lrn("classif.rpart", xval = 5, predict_type = "prob")

You can run the following command on the rpart object in order to see the CV result. Hint: If unsure how to extract the rpart object from your trained graph, check how this was done in part (a) above.

# load credit-g data and define task
rpart::plotcp(cart_trained_cv)
rpart::printcp(cart_trained_cv)
(c)

Pick an \(\alpha\) that is big enough and also has a low error. In the rpart package vignette, the following advice is given:

A plot of \(\alpha\) versus risk often has an initial sharp drop followed by a relatively flat plateau and then a slow rise. The choice of \(\alpha\) among those models on the plateau can be essentially random. To avoid this, both an estimate of the risk and its standard error are computed during the cross-validation. Any risk within one standard error of the achieved minimum is marked as being equivalent to the minimum (i.e.considered to be part of the flat plateau). Then the simplest model, among all those “tied” on the plateau, is chosen.

Train and then visualize the tree with the chosen \(\alpha\). (The relevant parameter is called cp).

(d)

Using the benchmark function, compare the predictive performance of the following five algorithms

  • A baseline model that uses no features (classif.featureless)
  • A non-pruned CART tree
  • A pruned CART tree with \(\alpha\) as chosen in part (c).
  • An auto-tuned xgboost (classif.xgboost). You could for example tune parameters in the following way:
    • eta = to_tune(0, 0.5),
    • nrounds = to_tune(10, 5000),
    • max_depth = to_tune(1, 10).
  • An auto-tuned random forest (classif.ranger). You could for example tune parameters in the following way:
    • mtry.ratio = to_tune(0.1, 1),
    • min.node.size = to_tune(1, 50).

You may want too look at more then just the classification error. You can for example run.

# load credit-g data and define task
res$aggregate(list(msr("classif.ce"),
                   msr("classif.acc"),
                   msr("classif.auc"),
                   msr("classif.fpr"),
                   msr("classif.fnr")))

Here res is the calculated benchmark object.

Remark

Note that we are not comparing how an optimally pruned decision tree compares to other algorithms that are optimally tuned. While this is possible (one would just need to decide on how to choose the optimal \(\alpha\) explicitly and define an auto-tuner accordingly), the purpose of this task is another. While we expect that a single decision tree will have poorer performance than tree ensembles, we would like to know how big the performance loss is if we choose to employ an interpretable decision tree. Here it is also essential that the \(\alpha\) we have chosen in (c) is big enough such that it leads to a small enough tree.

(e)

The German Credit dataset comes with a cost matrix https://www.openml.org/search?type=data&id=31&sort=runs&status=active.

Good (predicted) Bad (predicted)
Good (actual) 0 1
Bad (actual) 5 0

Use classif.costs(costs = mycosts), to define a measure with the given cost. Here, mycosts is the transpose of the cost matrix. Use the calculated benchmark object from (d) to see how the algorithms compare for this new measure.

(f)

If time allows you can re-run part (d) where the auto-tuned object are optimized via the measure defined in part (e) and see how much the results change.