Interpretable Machine Learning

Assignment II

Author

Munir Eberhardt Hiabu

Published

March 26, 2024

In this week’s assignment we want to compare different ways to encode categorical variables. A nice overview of different possibilities to encode a categorical variable and a benchmark on various data sets can be found in this article (Pargent et al. 2022). In that article the authors find that a strong candidate as alternative for target encoding is glmm-encoding where the prediction from a linear mixed model with a simple random intercept run for each categorical variable is used for encoding. The heuristic of glmm-encoding is that the conditional mean in target encoding is pulled towards the unconditional mean as a form of regularization. The pull is stronger for categories with low frequency.

Part 1 (Comparison of five encoding strategies on the credit-g data set)

  • Load the relevant packages the credit-g data and create the corresponding task.
library(mlr3)
library(mlr3learners)
library(mlr3tuning)
library(OpenML)
library(mlr3pipelines)
library(future)
future::plan("multisession") 

# load credit-g data and define task
credit_data = getOMLDataSet(data.id = 31)
task = as_task_classif(credit_data$data, target = "class") 
  • We will use a Random Forest from the ranger package as learner with the following settings.
lrn("classif.ranger",
    mtry.ratio = to_tune(0.1, 1),
    min.node.size = to_tune(1, 50),
    predict_type = "prob"
    )
  • To compare different encoding strategies, define five different graphs:
    1. dummy encoding %>>% Random Fores learner
    2. target encoding %>>% Random Fores learner (use po("encodeimpact"))
    3. Random Fores learner where target encoding is done within the ranger package (respect.unordered.factors = "order")
    4. Random Fores learner where target encoding is done within the ranger package and before every split (respect.unordered.factors = "partition")
    5. glmm encoding %>>% Random Fores learner (use po("encodelmer"))
  • Run a nested cross validation for each graph where
    • the inner CV (hyperparameter tuning) runs 5-fold CV with random search and 50 evaluations
    • the outer CV runs 3-fold CV
    • Measure the computational time it takes to run nested cross validation for each of the five graphs.
  • Compare the performance of the different graphs/encoding strategies both in terms of predictive performance as well as computational time

Part 2 (Come up with your own encoding strategy)

  • Come up with your own encoding strategy. You could for example encode some categorical features manually as integers by using your expert opinion for the ordering. You may also take a look at Chapter 9.2 in the ml3 book: phttps://mlr3book.mlr-org.com/chapters/chapter9/preprocessing.html, for some more inspiration. Compare the performance of your strategy to the previous five.

References

Pargent, Florian, Florian Pfisterer, Janek Thomas, and Bernd Bischl. 2022. “Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features.” Computational Statistics 37 (5): 2671–92.