| Quality | Linear Models | Machine Learning |
|---|---|---|
| interpretable | ✔ | ✘ |
| interactions automatically | ✘ | ✔ |
| variable selection/sparsity | ✔ | ✔ |
| non-linearity | ✘ | ✔ |
Fix a value \(x_0 \in \mathbb R^p\).
A local approximation at \(x_0\) of the function \(\hat m_n\) is given by
\[\begin{align}\label{eq:shap} \hat m_n\left(x_0\right)=\phi_{0}+\sum_{k=1}^{p} \phi_k(x_0), \end{align} \]
\(\phi_0,\phi_1(x_0),\dots,\phi_p(x_0)\) are constants.
Note: the right hand-side is not identified.
Local explanations add constraints such that \(\phi_k(x_0)\) is uniquely identified and best reflects the local contribution of feature \(k\) to \(\hat m_n\left(x_0\right)\).
Definition [Additive Value functions]
Shapley Axioms
Given a vaue function \(v_{x_0}\) with \(v_{x_0}(\{1,\dots,p\})=m(x_0)\). The four Shapley axioms are
Theorem [Existence and uniqueness of Shapley values (Shapley et al. 1953)]
Given a value function \(v_{x_0}\), there exist unique constants \(\phi_0,\phi_1(x_0),\dots,\phi_p(x_0)\) such that the four Shapley axioms are satisfied. They are given by \[ \begin{align} \phi_{k}(x_0)&=\frac{1}{p !} \sum_{\pi \in \Pi_p} \Delta_{v_{x_0}}\left(k, \{\pi(1),\dots,\pi(\pi^{-1}(k)-1)\}\right)\\ &=\frac 1 {p!}\sum_{S: S \subseteq \{1,\dots,p\} \setminus\{k\}} {|S| !(p -|S|-1) !}\Delta_{v_{x_0}}(k, S), \end{align} \] where \(\Delta_{v_{x_0}}(k, S)=v_{x_0}(S \cup k)-v_{x_0}(S)\) and \(\Pi_p\) is the set of permutations of \(\{1,\dots,p\}\).
Note that we will slightly abuse notation by ignoring ordering in the input of the functions below.
Lundberg and Lee (2017) proposed to use Shapley values with value function \[ v_x(S)=\mathbb E[ \hat m_n(X_S,X_{-S})| X_S=x_S] = \int \hat m_n(x_1,\dots,x_p) p_{X_{-S}|X_S}(x_{-S}|x_S) \mathrm dx_{-S} \] for model explanation. And called it SHAP (SHapley Additive exPlanations).
To simplify calculations, Lundberg and Lee (2017) proposed to calculate \[ v_x(S)= \mathbb E[\hat m_n(x_S,X_{-S})]= \int \hat m_n(x_1,\dots,x_p) p_{X_{-S}}(x_{-S}) \mathrm dx_{-S} \]
Janzing, Minorics, and Blöbaum (2020) argue that the latter value function should actually be the preffered value function. Chen et al. (2020) coin it interventional SHAP (and the original SHAP above observational SHAP).
In the next slide, we give an example why interventional SHAP values might be preferred compared to observational SHAP values.
Example [Observational SHAP vs Interventional SHAP(Janzing, Minorics, and Blöbaum 2020)]
Assume \[\hat m_n(x_1,x_2)=x_1\] Let \(X_1\), \(X_2\) be binary with
\[ p(x_1,x_2)= \begin{cases} \frac 1 2 & \text{if} \ x_1=x_2 \\ 0 & \text{else} \end{cases} \]
For observational SHAP, we have
Hence,
Such that
\(\phantom.\)
For interventional SHAP, we have
Hence,
Such that
Discussion Point:
How do we calculate/estimate Shapley values?
Remember, Shapley values are: \[ \begin{align} \phi_{k}(x_0)&=\frac{1}{p !} \sum_{\pi \in \Pi_p} \Delta_{v_{x_0}}\left(k, \{\pi(1),\dots,\pi(\pi^{-1}(k)-1)\}\right)\\ &=\frac 1 {p!}\sum_{S: S \subseteq \{1,\dots,p\} \setminus\{k\}} {|S| !(p -|S|-1) !}\Delta_{v_{x_0}}(k, S), \end{align} \]
To calculate Shapley values exact, one would (a) need to know the value function and (b) need to evaluate \(p!\) or (second equation) \(2^p\) summands. For large \(p\) that may be infeasible.
Definition (Estimate Shapley values via permutation sampling)
The Shapley value \(\phi_k\) can be approximated by the following algorithm:
\[ (\phi(x_0))_{0,\dots,p} = \arg\min_{(\phi_k)_{k=0,\dots,p}} \sum_{S: S\subseteq\{1,\dots,p\}} \mu(S) \left(v_{x_0}(S)- \left [\phi_0 + \sum_{k \in S} \phi_k\right]\right)^2, \] \[ \mu(S) =\frac{p-1} {\binom{p}{|S|} |S| (p-|S|)}. \]
In practice, this is often then solved by reformulating the minimization as a least squares linear regression problem: \[ \begin{aligned} \textrm{argmin}_{\phi \in \mathbb R^{p}} \sum_{i=1}^m\left((\nu(S_i) - \nu(\emptyset) )-M_i\phi \right)^2, \end{aligned} \] subject to \(\sum_{k=1}^p \phi_k = \nu(I_p) - \nu(\emptyset)\) and where \(M_i\) is the \(i\)th row of the matrix \(M\) with \(m\) rows \(p\) columns that binary encodes \(S_i\): \(M_{ij}= \mathbb 1(j\in S_i)\), \(j=1,\dots,p\).
Lundberg and Lee (2017) propose a method to estimate interventional SHAP that is fast.
It is specific for tree based algorithms.
The algorithm is called TreeSHAP and it makes use of the binary tree structures that leads to a partitioning of the feature space.
Unfortunately TreeSHAP does not correspond to an empirical version of Interventional SHAP.
In fact, two trees leading to the estimator can have different TreeSHAP values if their paths are different. In other words TreeHAP is path-specific which is an undesirable property.
An alternative to the standard TreeSHAP is TreeSHAP with background data.
The price to be is a much slower algorithm.
It calculates Interventional SHAP exactly in the sense that the estimated value function is equal to the empirical version where the expected value is replaced by the empirical probability derived from the background data \((\tilde X_j)_{j=1,\dots,J}\): \[ \frac{1}{n} \sum_{j=1}^J \hat{m}_n\left(x_S, \tilde X_{j,-S}\right) \]