imodels icon indicating copy to clipboard operation
imodels copied to clipboard

Unable to replicate results on `diabetes` data from paper

Open abhishek-ghose opened this issue 1 year ago • 6 comments

Hi,

I was trying to replicate some of the Random Forest results from the paper, specifically Figure 3(D) for the diabetes dataset - but I am unable to see the gap in AUC, as presented in the paper. Its probably me doing something silly :) - appreciate some help!

To simplify identifying a good max_depth for a Random Forest object, I'm using this class- this allows me to use scikit's GridSearchCV:

import utils as data_utils
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.base import BaseEstimator
from imodels import HSTreeClassifierCV
from matplotlib import pyplot as plt
import seaborn as sns; sns.set()

class HSRF(BaseEstimator):
    def __init__(self, reg_param_space, num_trees, max_depth):
        self.reg_param_space = reg_param_space
        self.num_trees = num_trees
        self.max_depth = max_depth
        self.HSTCV = None
        self.classes_ = None

    def fit(self, X, y):
        base_clf = RandomForestClassifier(n_estimators=self.num_trees, max_depth=self.max_depth)
        clf = HSTreeClassifierCV(base_clf, reg_param_list=self.reg_param_space)
        clf.fit(X, y)
        self.HSTCV = clf
        # this is needed for scikit's scorer to work
        self.classes_ = clf.estimator_.classes_
        return clf

    def predict(self, X):
        return self.HSTCV.predict(X)

    def predict_proba(self, X):
        return self.HSTCV.predict_proba(X)`

And here's my code - the X and y values passed in are from the diabetes dataset:

def run_mwe(X, y):
    reg_param_space = [0.1, 1.0, 10.0, 25.0, 50.0, 100.0]  # these are from the paper, Section 4.2
    num_trees_space = np.arange(2, 21, 2)
    max_depth_space = np.arange(1, 12, 2)

    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7)
    num_folds = 3
    df = pd.DataFrame(columns=['method', 'score', 'num_trees', 'cv_max_depth'])

    # RF experiments first
    for nt in num_trees_space:
        base_clf = RandomForestClassifier(class_weight='balanced', n_estimators=nt)
        clf = GridSearchCV(base_clf, cv=num_folds, param_grid={'max_depth': max_depth_space}, refit=True,
                           scoring='roc_auc')
        clf.fit(X_train, y_train)
        pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
        score = roc_auc_score(y_test, pr)
        df = df.append({'method': 'RF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
                       ignore_index=True)

    # HS Tree experiments next
    for nt in num_trees_space:
        clf = GridSearchCV(HSRF(reg_param_space=reg_param_space, num_trees=nt, max_depth=1),
                            param_grid={'max_depth': max_depth_space}, cv=num_folds, scoring='roc_auc', verbose=4,
                            refit=True)
        clf.fit(X_train, y_train)
        pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
        score = roc_auc_score(y_test, pr)
        df = df.append(
            {'method': 'HSRF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
            ignore_index=True)

    return df

When I plot the columns score against num_trees in df, I see something like this: image

abhishek-ghose avatar Aug 20 '22 00:08 abhishek-ghose

Thanks for your interest @abhishek-ghose! Sorry you've had a hard time replicating the result...not sure I see any blatant error with your code.

@aagarwal1996 - could you point Abhishek here to the exact code we used to run this experiment in the paper?

csinva avatar Aug 21 '22 00:08 csinva

Thank you for your response @csinva, and thank you for reviewing the snippet! Since my initial post, I have investigated a bit more. Made these changes:

  1. The RF-only experiments use class_weight='balanced' in their RandomForestClassifier objects, but I forgot to set this parameter in the HSRF class. Not sure how much impact this has, but to match up configurations, now the HSRF class also sets this parameter.
  2. The HSTreeClassifierCVclass uses a default number of folds=3; in my later experiments I set this to 5 for more rigor, and also to have more train data available per fold.
  3. Finally, I re-ran experiments over a bunch of datasets - diabetes, covtype.binary, ionosphere, adult, phishing, haberman - 10 times.

The train:test splits were 70%:30%, (stratified by label) and I limited the total datasize to ~3000 points where there was more data available, e.g., covtype.binary.

This is what the plots look like now - note the titles, they have the dataset name , number of trials, number of folds use to select the best max_depth for a given number of trees, and the number of folds for picking the regularization param, referred to as reg_param:

image

I tried to quantify predictability of HS being better by reporting p-values of a Wilcoxon test (lower is better, i.e., that HS leads to improvements is definitive), where for a dataset, the pairing is based on the number of trees. I also report closeness by calculating the RMSE between the RF and HSRF AUC values (higher is better when the Wilcoxon p is small, i.e., HSRF leads to large improvements). For either test, the mean AUC score across trials is used. Here are those numbers:

dataset p RMSE
0 diabetes 0.193359 0.0225293
1 covtype.binary 0.193359 0.0252946
2 ionosphere 0.695312 0.0392162
3 adult 0.0371094 0.00912343
4 phishing 0.921875 0.00572506
5 haberman 0.0195312 0.10074

abhishek-ghose avatar Aug 21 '22 01:08 abhishek-ghose

I have shared some code here (works for the diabetes dataset as-is): https://colab.research.google.com/drive/1UbR77KcWCvp5QyIokVbW3LvsnTz4MBHI?usp=sharing

abhishek-ghose avatar Aug 23 '22 20:08 abhishek-ghose

H @csinva @aagarwal1996 - any pointers you can provide? Thanks!

abhishek-ghose avatar Aug 26 '22 19:08 abhishek-ghose

Hi @abhishek-ghose thanks for checking in again. I'll provide some code that can reproduce our experiments shortly. One quick thing I noticed is that you seem to be tuning over max_depth for HS as well, whereas we used the default parameters for RF and only tune over the reg parameter

aagarwal1996 avatar Aug 26 '22 19:08 aagarwal1996

Thanks @aagarwal1996 - I tried out some experiments with not setting the max_depth parameter for RFs and I now see a gap (but this leads to some questions/observations, I have mentioned them towards the end of the comment). I performed the following experiments:

  1. I repeated the experiments where I am performing a CV over max_depth for both RF and HSRF. As before, results from 10 trials are presented, the number of folds for finding max_depth and the regularization param are both set to 5.
  2. Two new sets of experiments, where I am using the default max_depth setting for scikit's RF, which is None. I am calling these RFn and HSRFn (note the suffix n, for max_depth=None). As per scikit, this is what the setting means: If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples..

This is what those results look like.

image

For each dataset, I also calculated the p-value of the paired two-sided Wilcoxon signed-rank test. The pairing is over the number of trees in the RF(n)/HSRF(n) models, and average of trial scores are used. I performed the tests separately for RF-HSRF and RFn-HSRFn. Lower values are better, i.e., they indicate a significant difference between the AUC scores produced by the two techniques. Here they are:

dataset method1 method2 p
diabetes RF HSRF 0.322266
diabetes RFn HSRFn 0.00195312
covtype.binary RF HSRF 0.0371094
covtype.binary RFn HSRFn 0.00195312
ionosphere RF HSRF 0.275391
ionosphere RFn HSRFn 0.275391
adult RF HSRF 0.00976562
adult RFn HSRFn 0.00195312
phishing RF HSRF 0.193359
phishing RFn HSRFn 0.00195312
haberman RF HSRF 0.375
haberman RFn HSRFn 0.00195312

Now we see a gap; this is between RFn and HSRFn. But if I learn the max_depth as well (RF vs HSRF), this gap significantly diminishes, e.g., diabetes. Given that the setting of max_depth=None grows out the individual trees as far out as possible, potentially overfitting them, would it be correct to infer that hierarchical shrinkage is an alternative way to prevent overfitting - and other ways, like controlling the max_depth, are competitive?

abhishek-ghose avatar Aug 27 '22 17:08 abhishek-ghose