Hi, I played around with some simulated GAM (generalized additive model) data with correlated features and found that EBM sometimes arrives at wrong conclusions as to the risk-driver mappings. Small changes to the data may "fix" the issue, thus I cannot say that I see this erroneous fit consistently. Looks as if the mapping of single risk-drivers is something in between the actual mapping and the marginal conditional distribution (something like Prob(Y=1/X1=x))....?

Code Example

import numpy as np import pandas as pd from interpret.glassbox import ExplainableBoostingClassifier from scipy.stats import multivariate_normal from interpret import show from sklearn.metrics import roc_auc_score as auc from pygam import LogisticGAM, s import matplotlib.pyplot as plt

##construct data np.random.seed(123) n=100000 rho=0.9 df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3']) num=['x1','x2','x3'] cat=[] df['prob_true']=1/(1+np.exp(2-df.x1-2*np.sin(np.pi+df.x2))) df['tar']=np.random.binomial(n=1,p=df.prob_true) target='tar' auc(df.tar,df.prob_true)

df_test=df.sample(frac=0.2) df=df.drop(df_test.index)

##EBM clf = ExplainableBoostingClassifier(n_jobs=-1, interactions=0, random_state=0) clf.fit(df[cat+num],df[target]) df_test['pred_ebm']=clf.predict_proba(df_test[cat+num])[:,1] auc(df_test[target],df_test.pred_ebm) ##AUC is less than expected

##X1 is not a line, X3 is not flat ebm_global = clf.explain_global() show(ebm_global)

##compare to PyGAM results gam=LogisticGAM(s(0)+s(1)+s(2)) gam.fit(df[cat+num],df[target]) df_test['pred_pygam']=gam.predict_proba(pd.concat([np.clip(df_test[c],df[c].min(),df[c].max()) for c in cat]+[df_test[num]],axis=1)) auc(df_test[target],df_test.pred_pygam)

for i, term in enumerate(gam.terms): if term.isintercept: continue

  XX = gam.generate_X_grid(term=i)
  pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

  plt.figure()
  plt.plot(XX[:, term.feature], pdep)
  plt.plot(XX[:, term.feature], confi, c='r', ls='--')
  plt.title(repr(term))
  plt.show()

##change data slightly (delete pi) np.random.seed(123) n=100000 rho=0.9 df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3']) num=['x1','x2','x3'] cat=[] df['prob_true']=1/(1+np.exp(2-df.x1-2*np.sin(df.x2))) # deleted np.pi df['tar']=np.random.binomial(n=1,p=df.prob_true) target='tar' auc(df.tar,df.prob_true)

df_test=df.sample(frac=0.2) df=df.drop(df_test.index)

##EBM clf = ExplainableBoostingClassifier(n_jobs=-1, interactions=0, random_state=0) clf.fit(df[cat+num],df[target]) df_test['pred_ebm']=clf.predict_proba(df_test[cat+num])[:,1] auc(df_test[target],df_test.pred_ebm)

##X1 is a line, X2 is not as sinusoidal as expected, X3 still not flat ebm_global = clf.explain_global() show(ebm_global)

May 10 '21 11:05 ruedika

Hi @ruedika,

Thanks for the insightful question, and for taking the time to provide code! Unfortunately, no GAM fitting method can universally approximate all possible generator functions, and EBMs are no exception. As you've noticed, for functions where the generator components are very simple/smooth (e.g. linear or sinusoidal), tree based optimizations have a harder time than splines with sufficiently small sample size. However, in practice we've found that real data tends to be far messier, with occasional large step function like changes at significant thresholds (e.g. zero vs. non-zero is often highly significant in many datasets). In both benchmarks we've run and findings from the literature, we find that tree based methods tend to represent complex datasets better and outperform splines.

EBMs also have differences to other tree based GAM learning methods. Here's both EBM and a depth=1 LightGBM on your dataset (with rho=0.7 to highlight the differences a bit better):

While x1 and x2 are (arguably) better represented in EBMs, LightGBM is able to completely ignore the redundant x3. This is an active design choice we make in EBMs, which does come with some tradeoffs. For example, here's a slight modification of your problem where I added a new feature "x4" that is perfectly correlated with "x2", and used equivalently in the label generation:

np.random.seed(123)
n=100000
rho=0.7
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
df['x4'] = df['x2']
num=['x1','x2','x3', 'x4']
cat=[]
df['prob_true']=1/(1+np.exp(2-df.x1-np.sin(np.pi+df.x2)-np.sin(np.pi+df.x4)))
df['tar']=np.random.binomial(n=1,p=df.prob_true)

And the corresponding learned functions of both models:

EBM in this case learns almost identical functions for x2 and x4, whereas "greedy" tree based methods will arbitrarily choose one of the two features to put all of the probability mass on. We find that the property of proportionally allocating credit under high correlation (as opposed to putting as much mass into as few features as possible), helps users reason about the model and find bugs in their data (e.g. redundant copies of features that may cause problems later). Of course, we could still do better when it comes to regularizing redundant features out, and we're actively exploring research in this direction.

If you're curious about looking into this further, many of these topics are explored in greater detail by Kingsley Chang and his co-authors in this paper: https://arxiv.org/pdf/2006.06466.pdf . In it, they rigorously benchmark and explore how well different GAM fitting methods (including splines, XGBoost based trees, EBMs, and variants of linear models) learn a variety of additive generator functions.

Hope this helps, and happy to continue the discussion further! -InterpretML Team

May 12 '21 21:05 interpret-ml

Hi, Thanks for taking the time and for your comprehensive answer! But I think the issue I've mentioned is still one of convergence. To speed up convergence in the above example I've played with the learning rates. First to set up the experiment again:

import numpy as np
import pandas as pd
from interpret.glassbox import ExplainableBoostingClassifier
from scipy.stats import multivariate_normal
from interpret import show
from sklearn.metrics import roc_auc_score as auc
from pygam import LogisticGAM, s
import matplotlib.pyplot as plt

np.random.seed(123)
n=100000
rho=0.9
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
num=['x1','x2','x3']
cat=[]
df['prob_true']=1/(1+np.exp(2-df.x1-2*np.sin(np.pi+df.x2)))
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'
auc(df.tar,df.prob_true)

df_test=df.sample(frac=0.2)
df=df.drop(df_test.index)

Note that the AUC of a good model should be around 0.69 (model predictions mimicking the true underlying probabilities). The EBM set up as in my first post got 0.64. So I've fitted EBM with changing learning rates:

auc_list=[]
for lr in np.arange(0.01,0.2,0.01):
    clf = ExplainableBoostingClassifier(n_jobs=-1, interactions=0, random_state=0, learning_rate=lr)
    clf.fit(df[cat+num],df[target])
    df_test['pred_ebm']=clf.predict_proba(df_test[cat+num])[:,1]
    auc_list.append(auc(df_test[target],df_test.pred_ebm))

plt.plot(np.arange(0.01,0.2,0.01),auc_list)
plt.xlabel('learning rate')
plt.ylabel('AUC')

If you plot this (sorry, I can't upload) you'll see that starting from a learning rate of 0.05 EBM actually get's the right shapes. Which you can also verify directly:

clf = ExplainableBoostingClassifier(n_jobs=-1, interactions=0, random_state=0, learning_rate=0.1)
clf.fit(df[cat+num],df[target])

ebm_global = clf.explain_global()
show(ebm_global)

Overall I think that the round robin has convergence issues for this pathological dataset. One way to amend this, as you've done above, is to lower correlation. How to deal with this within the algorithm, I don't know. Maybe default learning rate schedules (instead of lr=0.01) could fix some cases?

Thanks & also happy continue discussion!

May 21 '21 05:05 ruedika

Inconsistent estimation under correlation

Code Example