Interpretable machine learning LIME+ELI5+SHAP+InterpretML Python Code

In my previous article I discussed on Interpretable machine learning framework theory what their role and scope in Machine Learning, why they are important, their advantages and disadvantages and how they explain human that which feature are positive or negative contributor for prediction. As a ML expert you should know what and why model given prediction.  Interpretable machine learning has a grate scope in health care and finance domain, because it has direct impact on human, so your model decision should be transparent and easy to explain to any non-technical person.

In this Article we will understand Interpretable framework by doing hand-on with python. For this article I have selected Heart Disease UCI dataset.   

First Installed required packages for LIME, SHAP, ELI5,  Microsoft InterpretML framework.

Pip install lime

Pip install SHAP

Pip install ELI5

Pip install InterpretML

Let’s define the necessary library and packages to begin the python jupyter code and import the dataset and check if there is any null value in dataset columns by missingno package which give the nice bar chart for columns missing values.

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings; warnings.simplefilter('ignore')
df = pd.read_csv('heart.csv',index_col=False,low_memory=False)
df.head()

Now by gooupby will try to find how many positive case vs negative case.

Now will break the dataset for all input feature as X and output target as Y. and split the data into training and test dataset. after splitting X into x_train and x_test  it become numy array which we need to convert back to pandas dataframe.

X=df.drop('target',axis=1)
Y=df['target']

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3,shuffle=True,random_state = 2658)

# Converting back to DataFrame
x_test = pd.DataFrame(x_test,columns=name)
x_train = pd.DataFrame(x_train,columns=name)

Its time to train our model with x_train dataset, i am using Logistic Regression ML model for this tutorial, However you can choose any classification model, but might be all Interpretable framework is not comfortable with other ML model. let’s get the confusion matrix and precision and recall score.  

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression 

logreg = LogisticRegression(random_state = 2658)
logreg.fit(x_train, y_train) 
y_pred = logreg.predict(x_test)
y_pred_prob = logreg.predict_proba(x_test)

print(confusion_matrix(y_test, y_pred))
print('accuracy:',accuracy_score(y_test, y_pred))
print('Precision:',precision_score(y_test, y_pred))
print('Recall:',recall_score(y_test, y_pred))

Next make a table which show the model probability prediction output vs actual output and select the case for True positive, True negative, false positive and false negative to understand the Interpretable model output.We can see form table that index no 0 is true positive case where model is predicting 1 and actual target output is also 1, similarly follow for other case.  

tp=0<br>fp=8<br>tn=5<br>fn=6

Interpretable framework require features name and target class name to give the easy to explain visualization.

class_names = ['No','Yes']<br>feature_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach','exang', 'oldpeak', 'slope', 'ca', 'thal']

LIME Python Code

Lime is a powerful model explainer and have capability to explain of any linear or complex model LIME can be use to explain tabular, text and image data very efficiently.

 We need to define explainer variable, since i am using tabular data for that LIME have library called LimeTabularExplainer which require train data, feature name and class name. Once explainer define we have to give data point or we can generate data point which we need LIME help to explain and out model name and number of feature.

#Import LIME 
import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(x_train.values,feature_names=feature_names,class_names=class_names,discretize_continuous=True)

exp_tpc1 = explainer.explain_instance(x_test.iloc[tp],logreg.predict_proba,num_features=6,top_labels=1)

#get the list of columns
exp_tpc1.as_list()

#Show output in Juputer notebook
print("Actual: ",y_test.values[tp],"Predicted :",y_pred[tp])
print(y_pred_prob[tp])
exp_tpc1.show_in_notebook(show_table=True,show_all=False)
[('ca &lt;= 0.00', 0.2214574317700435),
('sex &lt;= 0.00', 0.21638689024802996),
('1.00 &lt; cp &lt;= 2.00', 0.19886337550634117),
('oldpeak &lt;= 0.00', 0.19192152314869035),
('thal &lt;= 2.00', 0.1487189254943567), 
('1.00 &lt; slope &lt;= 2.00', 0.14801854748980278)]

Next we will see the case of True negative let see how LIME explain the contribution of features.

exp_tnc1 = explainer.explain_instance(x_test.iloc[tn],logreg.predict_proba,num_features=6,top_labels=1)
print("Actual: ",y_test.values[tn],"Predicted :",y_pred[tn])
print(y_pred_prob[tn])
exp_tnc1.show_in_notebook(show_table=True,show_all=False)

ELI5 Python code

As we Saw in last article ELI5 is also a good explainer and has capability to explain which features the contributing for model predicting , it give the nice tabular output where we can see which features are positively or negatively  affecting model prediction. 

import eli5
print("Actual: ",y_test.values[tp],"Predicted :",y_pred[tp])
eli5.show_prediction(logreg,x_test.iloc[tp],feature_names=feature_names,target_names=class_names)
print("Actual: ",y_test.values[tn],"Predicted :",y_pred[tn])
eli5.show_prediction(logreg,x_test.iloc[tn],feature_names=feature_names,target_names=class_names) 

SHAP

SHAP build to overcome the LIME limitation, SHAP use surrogate model and Shapley values. Shapely value is actually distribution, it’s a average of model contribution made by each player(features) over all permutation of player(features) . SHAP  provide 4 type of explainer:-

TreeExplainer:- Tree SHAP is a fast and exact method to estimate SHAP values for tree models and ensembles of trees, under several different possible assumptions about feature dependence. 

GradientExplainer:- Its a extension of the integrated gradients method a feature attribution method designed for differentiable models based on an extension of Shapley values to infinite player games (Aumann-Shapley values). 

DeepExplainer :- This is an enhanced version of the DeepLIFT algorithm (Deep SHAP) where, similar to Kernel SHAP, we approximate the conditional expectations of SHAP values using a selection of background samples. 

KernelExplainer :- Kernel SHAP is a method that uses a special weighted linear regression to compute the importance of each feature. The computed importance values are Shapley values from game theory and also coefficients from a local linear regression.

For our tabular dataset will use Kernel explainer to explain the features. SHAP use JAVA so after import SHAP we need to define shap.initjs() show show graph in jupyter notebook, then by shap.force.plot we can see the visualization. Note :- SHAP take long time to produce output graph based on dataset size.Now visualize true positive case and see how accurate SHAP predict the features contribution.

#Import the SHAP
import shap
shap.initjs()

# Define SHAP Explainer
explainer = shap.KernelExplainer(logreg.predict_proba,x_train)

shap_value = explainer.shap_values(x_test.iloc[tp,:])

#Print the graph
print("Actual: ",y_test.values[tp],"Predicted :",y_pred[tp])
shap.force_plot(explainer.expected_value[1],shap_value[1],x_test.iloc[tp,:])

Now take the True Negative case 

shap_value3 = explainer.shap_values(x_test.iloc[tn,:])
print("Actual: ",y_test.values[tn],"Predicted :",y_pred[tn])
shap.force_plot(explainer.expected_value[1],shap_value3[1],x_test.iloc[tn,:])

Now Train the SHAP for complete x_train dataset  values to get the graph and look overall picture when and where which feature weight goes high and low.

import warnings; warnings.simplefilter('ignore')
shap_value_all = explainer.shap_values(x_test)

shap.force_plot(explainer.expected_value[0],shap_value_all[0],x_test)

Lets get the features importance bar graph to look average impact on model output.

shap.summary_plot(shap_value_all,x_test)

 

InterpretML : Design and developed by Microsoft team which has very nice and interactive visualization, this framework is so easy to define and its use Plotly, Scikit-learn, LIME, SHAP, Salib, Treeinterprater, joblib and other packeges for training interpretable machine learning models and explaining black box model.  Let begin with importing library  and Show top 21 local features from dataset 

# Import the library
from interpret import show
from interpret.data import ClassHistogram

# define model
from interpret.glassbox import LogisticRegression
seed = 1
ebm = LogisticRegression(random_state=seed)
ebm.fit(x_train, y_train) 

#Show top 21 local features from dataset 
ebm_local = ebm.explain_local(x_test[:21], y_test[:21], name='EBM')
show(ebm_local)

 

Now see the ROC curve 

from interpret.perf import ROC

ebm_perf = ROC(ebm.predict_proba).explain_perf(x_test, y_test, name=’EBM’)
show(ebm_perf)

Histogram to show overall impact 

hist = ClassHistogram().explain_data(x_train, y_train, name = 'Train Data')<br />show(hist)

You May Also Like