Lecture 8: Local explanations

Munir Hiabu

Interpretability

  • In this lecture we shift the perspective and we wish to understand what a predictor \(\hat m_n(x)\) is actually doing.

Interpretability

Interpretability in this lecture:

  • Regression: \(Y_i=m(X_i)+\varepsilon_i\)
    • \(X\qquad \rightarrow \qquad \widehat m (X) \qquad \rightarrow \qquad m (X)\) or \(Y|X\)
  • For now interpretability is understanding the relationship between \(X\) and \(\widehat m (X)\)
    • This is different to understanding the relationship between \(X\) and \(Y\).
Quality Linear Models Machine Learning
interpretable
interactions automatically
variable selection/sparsity
non-linearity
  • Current machine learning algorithms are often highly flexible and can deal with interaction, non-linearity, sparsity and variable selection; but they are not interpretable.

Interpretability Local explanations

  • Fix a value \(x_0 \in \mathbb R^p\).

  • A local approximation at \(x_0\) of the function \(\hat m_n\) is given by

\[\begin{align}\label{eq:shap} \hat m_n\left(x_0\right)=\phi_{0}+\sum_{k=1}^{p} \phi_k(x_0), \end{align} \]

  • \(\phi_0,\phi_1(x_0),\dots,\phi_p(x_0)\) are constants.

  • Note: the right hand-side is not identified.

  • Local explanations add constraints such that \(\phi_k(x_0)\) is uniquely identified and best reflects the local contribution of feature \(k\) to \(\hat m_n\left(x_0\right)\).

Interpretability Local explanations: Shapley values

Definition [Additive Value functions]

  • A value function \(v\) assigns a real value \(v(S)\) to each subset \(S \subseteq \{1,\dots p\}\).

Shapley Axioms

Given a vaue function \(v_{x_0}\) with \(v_{x_0}(\{1,\dots,p\})=m(x_0)\). The four Shapley axioms are

  1. Efficiency \(\hat m_n\left(x_0\right)=\phi_{0}+\sum_{k=1}^{p} \phi_k(x_0)\), \(\phi_{0}=v_{x_0}(\emptyset)\).
  2. Symmetry: Fix any \(k,l \in \{1,\dots,p\}, k\neq l\). If \(v_{x_0}(S\cup k)=v_{x_0}(S\cup l)\), for all \(S \subseteq \{1,\dots p\}\setminus \{k,l\}\), then \(\phi_k(x_0)=\phi_l(x_0).\)
  3. Dummy For \(k=1,\dots,p\): If \(v_{x_0}(S\cup k)=v_{x_0}(S)\), for all \(S \subseteq \{1,\dots p\}\setminus \{k\}\), then \(\phi_k=0.\)
  4. Linearity For \(k=1,\dots,p\): If \(\hat m_n(x_0)=m^1(x_0)+m^2(x_0)\), then \(\phi_k(x_0)=\phi^1_k(x_0)+\phi^2_k(x_0)\), where \(\phi^l\) is the explanation corresponding to the function \(m^l\).

Interpretability Local explanations: Shapley values

Theorem [Existence and uniqueness of Shapley values (Shapley et al. 1953)]

Given a value function \(v_{x_0}\), there exist unique constants \(\phi_0,\phi_1(x_0),\dots,\phi_p(x_0)\) such that the four Shapley axioms are satisfied. They are given by \[ \begin{align} \phi_{k}(x_0)&=\frac{1}{p !} \sum_{\pi \in \Pi_p} \Delta_{v_{x_0}}\left(k, \{\pi(1),\dots,\pi(\pi^{-1}(k)-1)\}\right)\\ &=\frac 1 {p!}\sum_{S: S \subseteq \{1,\dots,p\} \setminus\{k\}} {|S| !(p -|S|-1) !}\Delta_{v_{x_0}}(k, S), \end{align} \] where \(\Delta_{v_{x_0}}(k, S)=v_{x_0}(S \cup k)-v_{x_0}(S)\) and \(\Pi_p\) is the set of permutations of \(\{1,\dots,p\}\).

Interpretability Local explanations: SHAP values

  • Note that we will slightly abuse notation by ignoring ordering in the input of the functions below.

  • Lundberg and Lee (2017) proposed to use Shapley values with value function \[ v_x(S)=\mathbb E[ \hat m_n(X_S,X_{-S})| X_S=x_S] = \int \hat m_n(x_1,\dots,x_p) p_{X_{-S}|X_S}(x_{-S}|x_S) \mathrm dx_{-S} \] for model explanation. And called it SHAP (SHapley Additive exPlanations).

  • To simplify calculations, Lundberg and Lee (2017) proposed to calculate \[ v_x(S)= \mathbb E[\hat m_n(x_S,X_{-S})]= \int \hat m_n(x_1,\dots,x_p) p_{X_{-S}}(x_{-S}) \mathrm dx_{-S} \]

  • Janzing, Minorics, and Blöbaum (2020) argue that the latter value function should actually be the preffered value function. Chen et al. (2020) coin it interventional SHAP (and the original SHAP above observational SHAP).

  • In the next slide, we give an example why interventional SHAP values might be preferred compared to observational SHAP values.

Interpretability Local explanations: SHAP values

Example [Observational SHAP vs Interventional SHAP(Janzing, Minorics, and Blöbaum 2020)]

Assume \[\hat m_n(x_1,x_2)=x_1\] Let \(X_1\), \(X_2\) be binary with

\[ p(x_1,x_2)= \begin{cases} \frac 1 2 & \text{if} \ x_1=x_2 \\ 0 & \text{else} \end{cases} \]

For observational SHAP, we have

  • \(v_x (\emptyset)=\mathbb E[\hat m_n(X_1,X_2) ]=0.5\)
  • \(v_x (\{1\})=\mathbb E[\hat m_n(X_1,X_2) |X_1=x_1 ]=x_1\)
  • \(v_x (\{2\})=\mathbb E[\hat m_n(X_1,X_2) |X_2=x_2 ]=x_2\)
  • \(v_x (\{1,2\})=\hat m_n(x_1,x_2) =x_1\)

Hence,

  • \(\Delta(2,\emptyset)= v_x (\{2\})-v_x (\emptyset)=x_1-0.5\)
  • \(\Delta(2,\{1\})= v_x (\{1,2\})-v_x (\{1\})=x_1-x_1=0\)

Such that

  • \(\phi_2= \frac 1 2 (x_1-0.5)\neq 0\)

\(\phantom.\)

For interventional SHAP, we have

  • \(v_x (\emptyset)=\mathbb E[\hat m_n(X_1,X_2) ]=0.5\)
  • \(v_x (\{1\})=\mathbb E[\hat m_n(x_1,X_2) ]=x_1\)
  • \(v_x (\{2\})=\mathbb E[\hat m_n(X_1,x_2) ]=0.5\)
  • \(v_x (\{1,2\})=[\hat m_n(x_1,x_2) =x_1\)

Hence,

  • \(\Delta(2,\emptyset)= v_x (\{2\})-v_x (\emptyset)=0\)
  • \(\Delta(2,\{1\})= v_x (\{1,2\})-v_x (\{1\})=0\)

Such that

  • \(\phi_2= 0\)

Interpretability Local explanations: SHAP values

  • We conclude, that if one uses observational SHAP for model explanation, then the contribution of a feature is a combination of its own contribution and the contribution of features it is correlated with.

Discussion Point:

  • If \(\mathbb P(X_1=X_2)\), then \(\tilde m_n(x)=x_2\) is almost surely as good in estimating the response as \(\hat m_n(x)=x_1\).
  • In particular the Bayes rule is not unique.
  • With that regard, it could make sense to attribute \(x_1\) and \(x_2\) the same importance.
  • An argument against assigning \(x_2\) any importance is that if \(\phi\) is to explain the behaviour of \(\hat m_n(x)=x_1\), then it is not directly affected by \(x_2\).
  • We will later see that observational SHAP has one further issue: One may need to extrapolate in order to calculate it.

Interpretability Local explanations: SHAP values: Estimation

  • How do we calculate/estimate Shapley values?

  • Remember, Shapley values are: \[ \begin{align} \phi_{k}(x_0)&=\frac{1}{p !} \sum_{\pi \in \Pi_p} \Delta_{v_{x_0}}\left(k, \{\pi(1),\dots,\pi(\pi^{-1}(k)-1)\}\right)\\ &=\frac 1 {p!}\sum_{S: S \subseteq \{1,\dots,p\} \setminus\{k\}} {|S| !(p -|S|-1) !}\Delta_{v_{x_0}}(k, S), \end{align} \]

  • To calculate Shapley values exact, one would (a) need to know the value function and (b) need to evaluate \(p!\) or (second equation) \(2^p\) summands. For large \(p\) that may be infeasible.

Interpretability Local explanations: SHAP values: Estimation

Definition (Estimate Shapley values via permutation sampling)

The Shapley value \(\phi_k\) can be approximated by the following algorithm:

  • For \(r=1,\dots\),
    • Sample a permutation \(\pi\) of \(\{1,\dots,p\}\)
    • Approximate \(\Delta_{v_{x_0}}\left(k, \{\pi(1),\dots,\pi(\pi^{-1}(k)-1)\}\right)\) (call the result \(\Delta^{(r)}\))
      • I.e approximate \(v(\{\pi(1),\dots,\pi(\pi^{-1}(k)-1)\}\cup \{k\})\) and \(v(\{\pi(1),\dots,\pi(\pi^{-1}(k)-1)\})\)
        • \(\Delta^{(r)}\) is the difference
      • If \(v(S)=\mathbb E[ \hat m_n(X_S,X_{-S})| X_S=x_S]\), (observational SHAP)
        • Not clear what to do. One would need to estimate the density \(p_{X_{-S}|X_S}\) first.
          • Which is a high dimensional problem…
      • If \(v(S)=\mathbb E[ \hat m_n(x_S,X_{-S})]\) (interventional SHAP)
        • Draw \(m\) individuals from \(\{1,\dots,n\}\), say \(I\subseteq n\),
        • \(v(S)\approx \frac 1 I \sum_{i \in I} \hat m_n(x_S,X_{i,-S})\)
  • \(\hat \phi_k\) is the average of all \(\Delta^{(r)}\)
  • Problem: many samples needed to get accurate approximation. Hence, slow.

Interpretability Local explanations: SHAP values: Estimation: Kernel SHAP

  • Charnes et al. (1988) discussed that Shapley values can be re-written as the solution of the following constraint minimisation problem:

\[ (\phi(x_0))_{0,\dots,p} = \arg\min_{(\phi_k)_{k=0,\dots,p}} \sum_{S: S\subseteq\{1,\dots,p\}} \mu(S) \left(v_{x_0}(S)- \left [\phi_0 + \sum_{k \in S} \phi_k\right]\right)^2, \] \[ \mu(S) =\frac{p-1} {\binom{p}{|S|} |S| (p-|S|)}. \]

  • Note that \(\mu(\emptyset)=\mu(\{1,\dots,p\})=\infty\) and so the minimisation is not well defined.
    • The infinite weight practically enforces \(\phi_0=v_{x_0}(\emptyset)\), \(\sum_{k=0}^p \phi_k=v_{x_0}(\{1,\dots,p\})\)
    • One can hence, exclude the two sets \((\emptyset, \{1,\dots,p\})\) from the minimisation and add the two constraints, \(\phi_0=v_{x_0}(\emptyset)\), \(\sum_{k=0}^p \phi_k =v_{x_0}(\{1,\dots,p\})\), to the minimisation.
  • This is a quadratic programming problem (good), but estimating \(v_{x_0}(S)\) for all \(2^p\) subsets might economically not be feasible.

Interpretability Local explanations: SHAP values: Estimation: Kernel SHAP

  • Introduced in Lundberg and Lee (2017), Covert and Lee (2021) give a detailed explanation on how Kernel SHAP is implemented.
    • Instead of estimating \(v_{x_0}(S)\) for all \(2^p\) subsets they only evaluate \(m\) subsets.
    • The \(m\) subsets are drawn from the set of all subsets of \(\{1,\dots,p\}\) minus the empty set and the full set, where the probability of drawing set \(S\) is proportional to \(\mu(S)\)
  • Hence the final minimisation reads \[\begin{align} \min_{(\phi_k)_{k=1,\dots,p}} \frac 1 m \sum_{i=1}^m \left(v_{x_0}(S_i)- \left [v_{x_0}(\emptyset)+ \sum_{k \in S_i} \phi_k\right]\right)^2, \\ \text{subject to} \sum_{k=1}^p \phi_k =v_{x_0}(\{1,\dots,p\})-v_{x_0}(\emptyset), \end{align} \] where \(S_i\) is the subset from draw \(i\).

Interpretability Local explanations: SHAP values: Estimation: Kernel SHAP

In practice, this is often then solved by reformulating the minimization as a least squares linear regression problem: \[ \begin{aligned} \textrm{argmin}_{\phi \in \mathbb R^{p}} \sum_{i=1}^m\left((\nu(S_i) - \nu(\emptyset) )-M_i\phi \right)^2, \end{aligned} \] subject to \(\sum_{k=1}^p \phi_k = \nu(I_p) - \nu(\emptyset)\) and where \(M_i\) is the \(i\)th row of the matrix \(M\) with \(m\) rows \(p\) columns that binary encodes \(S_i\): \(M_{ij}= \mathbb 1(j\in S_i)\), \(j=1,\dots,p\).

  • The value function used in Kernel SHAP is interventional SHAP, i.e., \(v_{x_0}(S)=\mathbb E[\hat m_n(x_S,X_{-S})]\).

Interpretability Local explanations: SHAP values: TreeSHAP

  • Lundberg and Lee (2017) propose a method to estimate interventional SHAP that is fast.

  • It is specific for tree based algorithms.

  • The algorithm is called TreeSHAP and it makes use of the binary tree structures that leads to a partitioning of the feature space.

    • For calculating \(\mathbb E[\hat m_n(x_S,X_{-S})]\), TreeSHAP recursively follows the decision path for \(x_0\) if the split feature is in \(S\), and takes the weighted average of both branches if the split feature is not in \(S\).
    • For efficient calculation, TreeSHAP does not go down the tree for every \(S\) separately but only once and while doing so keeps track of all possible \(S\).

Interpretability Local explanations: SHAP values: TreeSHAP

  • \(r_B\) is the amount of individuals in node \(B\)

Interpretability Local explanations: SHAP values: TreeSHAP

  • Unfortunately TreeSHAP does not correspond to an empirical version of Interventional SHAP.

  • In fact, two trees leading to the estimator can have different TreeSHAP values if their paths are different. In other words TreeHAP is path-specific which is an undesirable property.

Interpretability Local explanations: SHAP values: TreeSHAP with background data

  • An alternative to the standard TreeSHAP is TreeSHAP with background data.

  • The price to be is a much slower algorithm.

  • It calculates Interventional SHAP exactly in the sense that the estimated value function is equal to the empirical version where the expected value is replaced by the empirical probability derived from the background data \((\tilde X_j)_{j=1,\dots,J}\): \[ \frac{1}{n} \sum_{j=1}^J \hat{m}_n\left(x_S, \tilde X_{j,-S}\right) \]

Interpretability Local explanations: SHAP values: TreeSHAP with background data

References

Charnes, A, B Golany, M Keane, and J Rousseau. 1988. “Extremal Principle Solutions of Games in Characteristic Function Form: Core, Chebychev and Shapley Value Generalizations.” Econometrics of Planning and Efficiency, 123–33.
Chen, Hugh, Joseph D Janizek, Scott Lundberg, and Su-In Lee. 2020. “True to the Model or True to the Data?” arXiv Preprint arXiv:2006.16234.
Covert, Ian, and Su-In Lee. 2021. “Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression.” In International Conference on Artificial Intelligence and Statistics, 3457–65. PMLR.
Janzing, Dominik, Lenon Minorics, and Patrick Blöbaum. 2020. “Feature Relevance Quantification in Explainable AI: A Causal Problem.” In International Conference on Artificial Intelligence and Statistics, 2907–16. PMLR.
Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” Advances in Neural Information Processing Systems 30.
Shapley, Lloyd S et al. 1953. “A Value for n-Person Games.”