imodels
imodels copied to clipboard
Unable to replicate results on `diabetes` data from paper
Hi,
I was trying to replicate some of the Random Forest results from the paper, specifically Figure 3(D) for the diabetes
dataset - but I am unable to see the gap in AUC, as presented in the paper. Its probably me doing something silly :) - appreciate some help!
To simplify identifying a good max_depth
for a Random Forest object, I'm using this class- this allows me to use scikit's GridSearchCV
:
import utils as data_utils
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.base import BaseEstimator
from imodels import HSTreeClassifierCV
from matplotlib import pyplot as plt
import seaborn as sns; sns.set()
class HSRF(BaseEstimator):
def __init__(self, reg_param_space, num_trees, max_depth):
self.reg_param_space = reg_param_space
self.num_trees = num_trees
self.max_depth = max_depth
self.HSTCV = None
self.classes_ = None
def fit(self, X, y):
base_clf = RandomForestClassifier(n_estimators=self.num_trees, max_depth=self.max_depth)
clf = HSTreeClassifierCV(base_clf, reg_param_list=self.reg_param_space)
clf.fit(X, y)
self.HSTCV = clf
# this is needed for scikit's scorer to work
self.classes_ = clf.estimator_.classes_
return clf
def predict(self, X):
return self.HSTCV.predict(X)
def predict_proba(self, X):
return self.HSTCV.predict_proba(X)`
And here's my code - the X
and y
values passed in are from the diabetes
dataset:
def run_mwe(X, y):
reg_param_space = [0.1, 1.0, 10.0, 25.0, 50.0, 100.0] # these are from the paper, Section 4.2
num_trees_space = np.arange(2, 21, 2)
max_depth_space = np.arange(1, 12, 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7)
num_folds = 3
df = pd.DataFrame(columns=['method', 'score', 'num_trees', 'cv_max_depth'])
# RF experiments first
for nt in num_trees_space:
base_clf = RandomForestClassifier(class_weight='balanced', n_estimators=nt)
clf = GridSearchCV(base_clf, cv=num_folds, param_grid={'max_depth': max_depth_space}, refit=True,
scoring='roc_auc')
clf.fit(X_train, y_train)
pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
score = roc_auc_score(y_test, pr)
df = df.append({'method': 'RF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
ignore_index=True)
# HS Tree experiments next
for nt in num_trees_space:
clf = GridSearchCV(HSRF(reg_param_space=reg_param_space, num_trees=nt, max_depth=1),
param_grid={'max_depth': max_depth_space}, cv=num_folds, scoring='roc_auc', verbose=4,
refit=True)
clf.fit(X_train, y_train)
pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
score = roc_auc_score(y_test, pr)
df = df.append(
{'method': 'HSRF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
ignore_index=True)
return df
When I plot the columns score
against num_trees
in df
, I see something like this:
Thanks for your interest @abhishek-ghose! Sorry you've had a hard time replicating the result...not sure I see any blatant error with your code.
@aagarwal1996 - could you point Abhishek here to the exact code we used to run this experiment in the paper?
Thank you for your response @csinva, and thank you for reviewing the snippet! Since my initial post, I have investigated a bit more. Made these changes:
- The RF-only experiments use
class_weight='balanced'
in theirRandomForestClassifier
objects, but I forgot to set this parameter in the HSRF class. Not sure how much impact this has, but to match up configurations, now the HSRF class also sets this parameter. - The
HSTreeClassifierCV
class uses a default number of folds=3; in my later experiments I set this to 5 for more rigor, and also to have more train data available per fold. - Finally, I re-ran experiments over a bunch of datasets -
diabetes, covtype.binary, ionosphere, adult, phishing, haberman
- 10 times.
The train:test splits were 70%:30%, (stratified by label) and I limited the total datasize to ~3000 points where there was more data available, e.g., covtype.binary
.
This is what the plots look like now - note the titles, they have the dataset name , number of trials, number of folds use to select the best max_depth
for a given number of trees, and the number of folds for picking the regularization param, referred to as reg_param:
I tried to quantify predictability of HS being better by reporting p-values
of a Wilcoxon test (lower is better, i.e., that HS leads to improvements is definitive), where for a dataset, the pairing is based on the number of trees. I also report closeness by calculating the RMSE between the RF and HSRF AUC values (higher is better when the Wilcoxon p
is small, i.e., HSRF leads to large improvements). For either test, the mean AUC score across trials is used. Here are those numbers:
dataset | p | RMSE | |
---|---|---|---|
0 | diabetes | 0.193359 | 0.0225293 |
1 | covtype.binary | 0.193359 | 0.0252946 |
2 | ionosphere | 0.695312 | 0.0392162 |
3 | adult | 0.0371094 | 0.00912343 |
4 | phishing | 0.921875 | 0.00572506 |
5 | haberman | 0.0195312 | 0.10074 |
I have shared some code here (works for the diabetes
dataset as-is): https://colab.research.google.com/drive/1UbR77KcWCvp5QyIokVbW3LvsnTz4MBHI?usp=sharing
H @csinva @aagarwal1996 - any pointers you can provide? Thanks!
Hi @abhishek-ghose thanks for checking in again. I'll provide some code that can reproduce our experiments shortly. One quick thing I noticed is that you seem to be tuning over max_depth for HS as well, whereas we used the default parameters for RF and only tune over the reg parameter
Thanks @aagarwal1996 - I tried out some experiments with not setting the max_depth
parameter for RFs and I now see a gap (but this leads to some questions/observations, I have mentioned them towards the end of the comment). I performed the following experiments:
- I repeated the experiments where I am performing a CV over
max_depth
for both RF and HSRF. As before, results from 10 trials are presented, the number of folds for findingmax_depth
and the regularization param are both set to 5. - Two new sets of experiments, where I am using the default
max_depth
setting for scikit's RF, which isNone
. I am calling these RFn and HSRFn (note the suffixn
, formax_depth=None
). As per scikit, this is what the setting means:If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
.
This is what those results look like.
For each dataset, I also calculated the p-value
of the paired two-sided Wilcoxon signed-rank test. The pairing is over the number of trees in the RF(n)/HSRF(n) models, and average of trial scores are used. I performed the tests separately for RF-HSRF and RFn-HSRFn. Lower values are better, i.e., they indicate a significant difference between the AUC scores produced by the two techniques. Here they are:
dataset | method1 | method2 | p |
---|---|---|---|
diabetes | RF | HSRF | 0.322266 |
diabetes | RFn | HSRFn | 0.00195312 |
covtype.binary | RF | HSRF | 0.0371094 |
covtype.binary | RFn | HSRFn | 0.00195312 |
ionosphere | RF | HSRF | 0.275391 |
ionosphere | RFn | HSRFn | 0.275391 |
adult | RF | HSRF | 0.00976562 |
adult | RFn | HSRFn | 0.00195312 |
phishing | RF | HSRF | 0.193359 |
phishing | RFn | HSRFn | 0.00195312 |
haberman | RF | HSRF | 0.375 |
haberman | RFn | HSRFn | 0.00195312 |
Now we see a gap; this is between RFn and HSRFn. But if I learn the max_depth
as well (RF vs HSRF), this gap significantly diminishes, e.g., diabetes
. Given that the setting of max_depth=None
grows out the individual trees as far out as possible, potentially overfitting them, would it be correct to infer that hierarchical shrinkage is an alternative way to prevent overfitting - and other ways, like controlling the max_depth
, are competitive?