1 Background: Curse of dimensionality

Click here to view an interactive example of how the number of observations in a neighbourhood decreases when going from one to two to three dimensions.

Nonparametric models suffer from the curse of dimensionality in high dimensional data spaces, i.e., exponentially deteriorating estimation performance in the number of variables due to the growing distance between data points. Dimensionality reduction techniques (e.g. PCA or versions of neural nets) tackle this problem by assuming that the intrinsic dimension of the data is of lower dimensionality, i.e, most of the signal of the data can be reconstructed by transformation of a small subset of the initial variables. However, in many applications, interpretation of the initial variables is of importance and interpretation is not directly possible after transformation of the variables. Feature selection approaches (e.g. LASSO or random forest) assume that only a subset of the variables is relevant without transforming the variables. However, even without transformation of the variables, interpretation is still difficult because it is hard to visualise a multivariate function of more than two variables.

2 Structured models

An alternative to the approaches above is to introduce structure that stabilises the system and thereby possibly enhances predictive performance. Introducing structure has the additional advantage that it allows to visualise, interpret, extrapolate and forecast the properties of the underlying data. My research agenda is to exploit the idea of structured models to provide both enhanced predictions and/or quantifications of risk as well as interpretation and visualisation of predictors and risk factors. My fundamental research in mathematical statistics and statistical machine learning is motivated and driven by high impact applications. Many of those applications are in the framework of survival analysis such that within the general framework of structured models and nonparametric statistics, I often work in survival analysis and counting process theory settings that allow to accommodate missing data arising through truncation and censoring. Below are two selected recent examples.

2.1 Example 1: Nonparametric proportional hazard model

Hiabu M, Mammen E, Martinez-Miranda MD and Nielsen JP (2020+): Smooth backfitting of proportional hazards with multiplicative components. Journal of the American Statistical Association (Theory&Methods) [arxiv link] [doi] [github]

Smooth backfitting has proven to have a number of theoretical and practical advantages in structured regression. By projecting the data down onto the structured space of interest smooth backfitting provides a direct link between data and estimator. This article introduces the ideas of smooth backfitting to survival analysis in a proportional hazard model, where we assume an underlying conditional hazard with multiplicative components. We develop asymptotic theory for the estimator. In a comprehensive simulation study, we show that our smooth backfitting estimator successfully circumvents the curse of dimensionality and outperforms existing estimators. This is especially the case in difficult situations like high number of covariates and/or high correlation between the covariates, where other estimators tend to break down. We use the smooth backfitter in a practical application where we extend recent advances of in-sample forecasting methodology by allowing more information to be incorporated, while still obeying the structured requirements of in-sample forecasting.

Simulation study with 200 simualtions: Our pro- posed estimator (SBF) outperformes an alternative estimator. The left panel is under- smoothed (b_k = 0.2). The right panel shows that additional smoothing (b_k = 0.3) makes the Lin He Huang (2016) estimator too flat.

2.2 Example 2: Multiple time-scales influencing risk factors

Hiabu M, Nielsen, JP and Scheike, T (2020+): Non-Smooth Backfitting for Excess Risk Additive Regression Model for Survival. Biometrika [arxiv link] [doi] [github]

We consider an extension of Aalen’s additive regression model allowing covariates to have effects that vary on two different time-scales. The two time-scales considered are equal up to a constant that varies for each individual, such as follow-up time and age in medical studies or calendar time and age in longitudinal studies. The model was introduced in Scheike (2001) where it was solved via smoothing techniques. We present a new backfitting algorithm estimating the structured model without having to use smoothing. Estimators of the cumulative regression functions on the two time-scales are suggested by solving local estimating equations jointly on the two time-scales. We provide large sample properties and simultaneous confidence bands. The model is applied to data on myocardial infarction providing a separation of the two effects stemming from time since diagnosis and age.

Data application: Diabetes has no interaction with duration (time since heart attack (myocardial infarction)), but interaction with patients’s age (the older the patient the greater the effect of diabetes on mortality). The two other risk factors (congestive heart failure (chf) & ventricular fibrillation (vf)) are mainly important in the first 100 days after a heart attack.

3 Interpretable Machine Learning

3.1 Example 1: Random Planted Forest

Hiabu M, Mammen E, Meyer J: Random Planted Forest: a directly interpretable tree ensemble [arxiv link] [github]

Key Features:

  • We consider a regression problem and assume that the regression function can be well approximated by lower order terms in a functional ANOVA expansion
    \(m(x)=m_0+\sum_{k=1}^d m_k(x_{k})+\sum_{k\lt l}m_{kl}(x_{k},x_{l}) + \cdots.\)
  • Maximal order of interaction can be specified. If an order smaller or equal of two is chosen, then the model is directly fully interpretable by plotting the one dimensional functions as curves and the two dimensional functions as heatmaps.

  • While fully flexible the algorithm follows a structured path by growing a family of trees simultaneously along a functional ANOVA expansion. Below is an illustration of a family of planted trees. Higher order trees are descendants of lower order trees. Trees grow simultaneously and the height of the edges indicate the order at which splits occurred.

  • A first simulation study in our paper shows very promising results. The random planted forest seems able to detect both jumps in the regression function as well as interactions between predictors. In particular in sparse settings, the random planted forest proved an unmatched combination of accuracy and flexibility. Below is an illustration of how the random planted forest shows excellent performance in the detection of jumps.
  • image1