Lecture 10: Functional decomposition

Munir Hiabu

Local and Global Explanations

In this lecture we will connect the concept of local and global explanations via functional decompositions

Partial dependence plots Toy example I

https://detralytics.com/wp-content/uploads/2020/03/Faqctuary_2020-02_Features-with-flat-partial-dependence-plots.pdf

\(X_1=\) Gender: policyholder’s gender (female or male);
\(X_2=\) Age: policyholder’s age (integer values from 18 to 65 );
\(X_3=\) Split: whether the policyholder splits its annual premium or not (yes or no);
\(X_4=\) Sport: whether the policyholder’s car is a sports car or not (yes or no).

\(\begin{aligned} \lambda(\boldsymbol{x})= & 0.3 \times\left(1+0.1 I_{\left\{{\color{#5692e4}x_1}=\text { male }\right\}}\right) \\ & \times\left(1+\frac{1}{\sqrt{{\color{#D9421A}x_2}-17}}\right) \\ & \times\left(1+0.3 I_{\left\{18 \leq {\color{#D9421A}x_2}<35\right\}I_{\left\{{\color{#009151}x_4}=\text { yes }\right\}}}-0.3 I_{\left\{45 \leq {\color{#D9421A}x_2}<65\right\}I_{\left\{{\color{#009151}x_4}=\text { yes }\right\}}}\right) \end{aligned}\)

Partial dependence plots Toy example I

Random forest explanations

Problem in Toy Example: partial dependence plots indicate that \(X_4\) (sports car) has no effect. This is happening because \(X_4\) only has an effect via an interaction with \(X_2\) (age).

Shapley values Toy example II

Toy example II

\[\hat m_n(x_1,x_2)=x_1+x_2 + 2x_1x_2.\]

The interventional SHAP value for the first feature is \[ \phi_1(x_1,x_2)=x_1-\mathbb E[X_1]+x_1x_2 -\mathbb E[X_1X_2]+x_1\mathbb E[X_2]-x_2\mathbb E[X_1]. \] If the features are standardized, i.e., \(X_1\) and \(X_2\) have mean zero and variance one, then

\[\phi_1(x_1,x_2)=x_1+x_1x_2 -\textrm{corr}(X_1,X_2).\]

Assume \(\textrm{corr}(X_1,X_2)=0.3\)
- individual with \(x_1=1\) and \(x_2=-0.7\) would see a SHAP value of 0 for the first feature: \[\phi_1(1,-0.7) = 0.\]

Conclusion of example I+II

Partial dependence plots seem to ignore interaction effects
Interventional SHAP values merge interaction effects and main effect and can thereby cancel each other out
Conclusion: Both SHAP values and partial dependence plots have problems with interactions…

Functional decompositions

Assume a data set with \(p\) features. Also assume that we can approximate the regression function \(m\) by a (q-th) order functional decomposition:

\[m(x) \approx m_0+\sum_{k=1}^p m_k(x_{k}) + \sum_{k_1<k_2} m_{k_1k_2}(x_{k_1},x_{k_2}) + \cdots +\sum_{k_1<\cdots <k_q} m_{k_1,\dots,k_q} (x_{k_1},\dots,x_{k_q}).\]

conjectured optimal rates of convergence under the assumption that \(m\) has two continuous partial derivatives:

Model general	\(p= 6\)	Comparable sample sizes for \(p= 6\)
Full model	\(O_p(n^{-2/(p+4)})\) = \(O_p(n^{-1/5})\)	1 000 000
Interaction (q)	\(O_p(n^{-2/(q+4)})\)	1 000 -- 1 000 000
Interaction (q=2)	\(O_p(n^{-1/3})\)	4 000
Additive (q=1)	\(O_p(n^{-2/5})\)	1 000

Functional decompositions Generalized ANOVA

Generalized ANOVA is probably the most considered functional decomposition identification Stone (1994). We consider \(\mathcal X=\mathbb R^{p}\) (such that the number of covariates is no being confused with the density).

Generalized ANOVA: For every \(S\subseteq \{1,\dots,p\}\)
and \(k\in S\), \[\int m_{S}\left(x_{S}\right) \int w(x) \mathrm dx_{-S} \ \mathrm d x_{k}=0\]

Common choices for \(w\) are

\(w \equiv 1\)
\(w(x)=p(x),\ \) \(p(x)=\) density of \(X\).
\(w(x)=\prod p_j(x_j),\ \) \(p_j(x_j)=\) density of \(X_j.\)

Generalized ANOVA, however, has drawbacks for interpretability
- \(w=1\) ignores the distribution of \(X\)
- \(w(x)=p(x)\) uses the correlation structure (analog to observational SHAP)
- \(w(x)=\prod p_j(x_j)\) Assumes that features are independent

Marginal Identification

Functional decompositions Marginal Identification

Definition [Marginal identification]

Consider a regression function \(m: \mathbb R^p \rightarrow \mathbb R\) with functional decomposition \[ m(x)=\sum_{S\subseteq I_p} m_S \] If for every \(S\subseteq I_p\), \[\sum_{T: T \cap S \neq \emptyset} \int m_T(x_T) p_{S}(x_{S}) \mathrm dx_S=0,\] we say that the functional decomposition satifies the marginal identification.

Theorem (Existence and uniqueness of marginal identification)

Given a function \(m\), there exists exactly one set of components \(\{m_S| S \subseteq I_p\}\) that fulfill the marginal identification. They are given as \[ m_S (x_S)= \sum_{V\subset S} (-1)^{|S\setminus V|}\xi_S(x_S), \] where \(\xi_S\) doentes the partial dependence plot of a set \(S\). In particular, and \(\xi_\emptyset=\mathbb E[m(X)]\).

A functional decomposition Marginal Identification

Corollary [Marginal identification \(\leftrightarrow\) PDP ]

If \(\hat m_n=\sum_S \hat m_S\) satisfies the marginal identification, then \[\xi_S=\sum_{U \subseteq S} \hat m_U.\] In particular if \(S\) is only one feature, i.e., \(S=\{k\}\), we have \[\xi_k(x_k)= \hat m_0 + \hat m_k(x_k).\]

Functional decompositions Marginal Identification: Interventional SHAP values

Theorem [Marginal identification \(\leftrightarrow\) Interventional SHAP ]

If \(\hat m_n=\sum_S \hat m_S\) satisfies the marginal identification, then

Interventional SHAP values are weighted averages of the components
- interaction component is equally split between involved features:

\[ \phi_k(x)= \hat m_k(x_k)+ \frac 1 2 \sum_j \hat m_{kj}(x_{kj}) + \cdots + \frac 1 p \hat m_{1,\dots,p}(x_{1,\dots,p}). \]

Functional decompositions Partial Dependence Plot: Toy example I (revisited)

We now fit and interpret the data via a functional decomposition with marginal identification.

library(xgboost)
library(glex)
xg <- xgboost(data = cbind(x1,x2,x3,x4), label = y, params = list(max_depth = 4, eta = .01, objective = "count:poisson"), nrounds = 100,verbose = 0)
res <- glex(xg, x) #### calculates functional decomposition with marginal identification
res$m

                 x1           x2           x4         x1:x2        x1:x4
              <num>        <num>        <num>         <num>        <num>
     1: -0.01153930  0.254763806 -0.003807825  0.0115393039  0.001131369
     2: -0.01153930 -0.070856660  0.003796980 -0.0030981568 -0.001124274
     3: -0.01153930 -0.073284602  0.003796980 -0.0006702144 -0.001124274
     4: -0.01153930 -0.009837431  0.003796980 -0.0020662462 -0.001124274
     5:  0.01153167 -0.069309124  0.003796980  0.0015543309  0.001129098
    ---                                                                 
499996:  0.01153167 -0.009837431  0.003796980  0.0020742189  0.001129098
499997: -0.01153930  0.064109258 -0.003807825  0.0022360787  0.001131369
499998: -0.01153930  0.075669850 -0.003807825  0.0022360787  0.001131369
499999: -0.01153930 -0.075727647  0.003796980 -0.0031179563 -0.001124274
500000:  0.01153167 -0.075727647 -0.003807825  0.0031216284 -0.001129654
               x2:x4      x1:x2:x4
               <num>         <num>
     1:  0.067772348 -1.131369e-03
     2:  0.052166467 -2.518432e-03
     3:  0.049741495 -9.346023e-05
     4: -0.005752768 -4.512167e-04
     5:  0.053718925  9.701917e-04
    ---                           
499996: -0.005752768  4.503037e-04
499997:  0.051318240 -2.826504e-03
499998:  0.062867555 -2.826504e-03
499999:  0.047290680 -2.538208e-03
500000: -0.047238057 -2.530277e-03

Functional decompositions Partial Dependence Plot: Toy example I (revisited)

Code

library(glex)
# Model fitting
library(xgboost)
# Visualization
library(ggplot2)
library(patchwork)
theme_set(theme_minimal(base_size = 13))
vi_xgb <- glex_vi(res)

p_vi <- autoplot(vi_xgb, threshold = 0) + 
  labs(title = NULL, tag = "XGBoost-explanation")

p_vi+
  plot_annotation(title = "Variable importance scores by term") & 
  theme(plot.tag.position = "bottomleft")

Functional decompositions Partial Dependence Plot: Toy example I (revisited)

#| code-fold: true
barplot(as.numeric(c(res$m$x1[which.min(x1==0)], res$m$x1[which.min(x1==1)])), names.arg=("female, male") )

barplot(as.numeric(c(res$m$x4[which.min(x4==1)], res$m$x4[which.min(x4==0)])), names.arg=("sportcar: yes, sportcar: no") )

plot(as.numeric(x2[order(x2)]), as.numeric(res$m$x2[order(x2)]),type="l", xlab="x2", ylab="m2")

plot(as.numeric(x2[which(x4==0)][order(x2[which(x4==0)])]),as.numeric(res$m$`x2:x4`[which(x4==0)][order(x2[which(x4==0)])]), type="l" ,ylim=c(-0.12,0.12), lwd=4, xlab="x2", ylab="m24")
lines(as.numeric(x2[which(x4==1)][order(x2[which(x4==1)])]),as.numeric(res$m$`x2:x4`[which(x4==1)][order(x2[which(x4==1)])]), type="l", col="blue", lwd=4)
legend("topright", legend=c("no sports car", "sports car"),
       col=c("black", "blue"), cex=1, lwd=5)

Functional decompositions Interventional SHAP: Toy example II (revisited)

\[\hat m_n(x_1,x_2)=x_1+x_2+x_1x_2\] If \(X_1, X_2\) have each mean zero and variance one, then

Functional decomposition with marginal identification: \[ \begin{eqnarray} \hat m_0&=&2corr(X_1,X_2) \\ \hat m_1(x_1)&=& x_1 -2corr(X_1,X_2) \\ \hat m_2(x_2)&=&x_2 - 2corr(X_1,X_2) \\ \hat m_{12}(x_1,x_2)&=& 2x_1x_2 + 2corr(X_1,X_2). \end{eqnarray} \]
Partial dependence plot of \(x_1\) is \(\hat m_1(x_1)\)
Interventional SHAP of \(x_1\) is \(\hat m_1(x_1)+ 0.5 \times \hat m_{12}(x_1,x_2)\)

Stone, Charles J. 1994. “The Use of Polynomial Splines and Their Tensor Products in Multivariate Function Estimation.” The Annals of Statistics 22 (1): 118–71.