\(\begin{aligned} \lambda(\boldsymbol{x})= & 0.3 \times\left(1+0.1 I_{\left\{{\color{#5692e4}x_1}=\text { male }\right\}}\right) \\ & \times\left(1+\frac{1}{\sqrt{{\color{#D9421A}x_2}-17}}\right) \\ & \times\left(1+0.3 I_{\left\{18 \leq {\color{#D9421A}x_2}<35\right\}I_{\left\{{\color{#009151}x_4}=\text { yes }\right\}}}-0.3 I_{\left\{45 \leq {\color{#D9421A}x_2}<65\right\}I_{\left\{{\color{#009151}x_4}=\text { yes }\right\}}}\right) \end{aligned}\)
Toy example II
\[\hat m_n(x_1,x_2)=x_1+x_2 + 2x_1x_2.\]
The interventional SHAP value for the first feature is \[ \phi_1(x_1,x_2)=x_1-\mathbb E[X_1]+x_1x_2 -\mathbb E[X_1X_2]+x_1\mathbb E[X_2]-x_2\mathbb E[X_1]. \] If the features are standardized, i.e., \(X_1\) and \(X_2\) have mean zero and variance one, then
\[\phi_1(x_1,x_2)=x_1+x_1x_2 -\textrm{corr}(X_1,X_2).\]
Partial dependence plots seem to ignore interaction effects
Interventional SHAP values merge interaction effects and main effect and can thereby cancel each other out
Conclusion: Both SHAP values and partial dependence plots have problems with interactions…
Functional decompositions
Assume a data set with \(p\) features. Also assume that we can approximate the regression function \(m\) by a (q-th) order functional decomposition:
\[m(x) \approx m_0+\sum_{k=1}^p m_k(x_{k}) + \sum_{k_1<k_2} m_{k_1k_2}(x_{k_1},x_{k_2}) + \cdots +\sum_{k_1<\cdots <k_q} m_{k_1,\dots,k_q} (x_{k_1},\dots,x_{k_q}).\]
| Model general | \(p= 6\) | Comparable sample sizes for \(p= 6\) |
|---|---|---|
| Full model | \(O_p(n^{-2/(p+4)})\) = \(O_p(n^{-1/5})\) | 1 000 000 |
| Interaction (q) | \(O_p(n^{-2/(q+4)})\) | 1 000 -- 1 000 000 |
| Interaction (q=2) | \(O_p(n^{-1/3})\) | 4 000 |
| Additive (q=1) | \(O_p(n^{-2/5})\) | 1 000 |
Generalized ANOVA is probably the most considered functional decomposition identification Stone (1994). We consider \(\mathcal X=\mathbb R^{p}\) (such that the number of covariates is no being confused with the density).
Common choices for \(w\) are
Marginal Identification
Definition [Marginal identification]
Consider a regression function \(m: \mathbb R^p \rightarrow \mathbb R\) with functional decomposition \[ m(x)=\sum_{S\subseteq I_p} m_S \] If for every \(S\subseteq I_p\), \[\sum_{T: T \cap S \neq \emptyset} \int m_T(x_T) p_{S}(x_{S}) \mathrm dx_S=0,\] we say that the functional decomposition satifies the marginal identification.
Theorem (Existence and uniqueness of marginal identification)
Given a function \(m\), there exists exactly one set of components \(\{m_S| S \subseteq I_p\}\) that fulfill the marginal identification. They are given as \[ m_S (x_S)= \sum_{V\subset S} (-1)^{|S\setminus V|}\xi_S(x_S), \] where \(\xi_S\) doentes the partial dependence plot of a set \(S\). In particular, and \(\xi_\emptyset=\mathbb E[m(X)]\).
Corollary [Marginal identification \(\leftrightarrow\) PDP ]
If \(\hat m_n=\sum_S \hat m_S\) satisfies the marginal identification, then \[\xi_S=\sum_{U \subseteq S} \hat m_U.\] In particular if \(S\) is only one feature, i.e., \(S=\{k\}\), we have \[\xi_k(x_k)= \hat m_0 + \hat m_k(x_k).\]
Theorem [Marginal identification \(\leftrightarrow\) Interventional SHAP ]
If \(\hat m_n=\sum_S \hat m_S\) satisfies the marginal identification, then
\[ \phi_k(x)= \hat m_k(x_k)+ \frac 1 2 \sum_j \hat m_{kj}(x_{kj}) + \cdots + \frac 1 p \hat m_{1,\dots,p}(x_{1,\dots,p}). \]
We now fit and interpret the data via a functional decomposition with marginal identification.
library(xgboost)
library(glex)
xg <- xgboost(data = cbind(x1,x2,x3,x4), label = y, params = list(max_depth = 4, eta = .01, objective = "count:poisson"), nrounds = 100,verbose = 0)
res <- glex(xg, x) #### calculates functional decomposition with marginal identification
res$m x1 x2 x4 x1:x2 x1:x4
<num> <num> <num> <num> <num>
1: -0.01153930 0.254763806 -0.003807825 0.0115393039 0.001131369
2: -0.01153930 -0.070856660 0.003796980 -0.0030981568 -0.001124274
3: -0.01153930 -0.073284602 0.003796980 -0.0006702144 -0.001124274
4: -0.01153930 -0.009837431 0.003796980 -0.0020662462 -0.001124274
5: 0.01153167 -0.069309124 0.003796980 0.0015543309 0.001129098
---
499996: 0.01153167 -0.009837431 0.003796980 0.0020742189 0.001129098
499997: -0.01153930 0.064109258 -0.003807825 0.0022360787 0.001131369
499998: -0.01153930 0.075669850 -0.003807825 0.0022360787 0.001131369
499999: -0.01153930 -0.075727647 0.003796980 -0.0031179563 -0.001124274
500000: 0.01153167 -0.075727647 -0.003807825 0.0031216284 -0.001129654
x2:x4 x1:x2:x4
<num> <num>
1: 0.067772348 -1.131369e-03
2: 0.052166467 -2.518432e-03
3: 0.049741495 -9.346023e-05
4: -0.005752768 -4.512167e-04
5: 0.053718925 9.701917e-04
---
499996: -0.005752768 4.503037e-04
499997: 0.051318240 -2.826504e-03
499998: 0.062867555 -2.826504e-03
499999: 0.047290680 -2.538208e-03
500000: -0.047238057 -2.530277e-03
library(glex)
# Model fitting
library(xgboost)
# Visualization
library(ggplot2)
library(patchwork)
theme_set(theme_minimal(base_size = 13))
vi_xgb <- glex_vi(res)
p_vi <- autoplot(vi_xgb, threshold = 0) +
labs(title = NULL, tag = "XGBoost-explanation")
p_vi+
plot_annotation(title = "Variable importance scores by term") &
theme(plot.tag.position = "bottomleft")#| code-fold: true
barplot(as.numeric(c(res$m$x1[which.min(x1==0)], res$m$x1[which.min(x1==1)])), names.arg=("female, male") )barplot(as.numeric(c(res$m$x4[which.min(x4==1)], res$m$x4[which.min(x4==0)])), names.arg=("sportcar: yes, sportcar: no") )plot(as.numeric(x2[which(x4==0)][order(x2[which(x4==0)])]),as.numeric(res$m$`x2:x4`[which(x4==0)][order(x2[which(x4==0)])]), type="l" ,ylim=c(-0.12,0.12), lwd=4, xlab="x2", ylab="m24")
lines(as.numeric(x2[which(x4==1)][order(x2[which(x4==1)])]),as.numeric(res$m$`x2:x4`[which(x4==1)][order(x2[which(x4==1)])]), type="l", col="blue", lwd=4)
legend("topright", legend=c("no sports car", "sports car"),
col=c("black", "blue"), cex=1, lwd=5)\[\hat m_n(x_1,x_2)=x_1+x_2+x_1x_2\] If \(X_1, X_2\) have each mean zero and variance one, then
Functional decomposition with marginal identification: \[ \begin{eqnarray} \hat m_0&=&2corr(X_1,X_2) \\ \hat m_1(x_1)&=& x_1 -2corr(X_1,X_2) \\ \hat m_2(x_2)&=&x_2 - 2corr(X_1,X_2) \\ \hat m_{12}(x_1,x_2)&=& 2x_1x_2 + 2corr(X_1,X_2). \end{eqnarray} \]
Partial dependence plot of \(x_1\) is \(\hat m_1(x_1)\)
Interventional SHAP of \(x_1\) is \(\hat m_1(x_1)+ 0.5 \times \hat m_{12}(x_1,x_2)\)