EconML Discrepancy between 0.14.1 and 0.15.0

Discrepancy between 0.14.1 and 0.15.0

Open winston-zillow opened this issue 1 year ago • 0 comments

I have a fixed dataset with ~200 covariates and a 6-category discrete treatment and trained CausalForestDML models in both v0.14.1 and v0.15.0 with identical configs and codes, but the results don't quite agree. For example, in one of the treatments that is considered "no harm", the v0.14.1 model estimates 9% of the treated having negative effects due to the treatment while the v0.15.0 estimates 42% of the treated having negative effects. Only 4.4% of the training sample have this treatment opted-in; but another treatment with 11% prevalence also exhibits this kind of discrepancy. (I weighted sample inversely to the prevalence of the treatments.) The v0.15.0 figure is also similar to what I would get if I just use the econml.grf.CausalForest.

My dataset is proprietary real-world data and I haven't tried to see if this can be shown in synthetic data. The results above can be reproduced in multiple runs in each versions.

Further I noticed that v0.15.0 was released a few days ago on Feb 14, 2024 but has no release tag.

I wonder what's the differences between these two versions?

My estimators is defined as

n_trees, n_subtrees = 128, 128 // 4
self.estimator = CausalForestDML(
            model_y=RandomForestRegressor(n_estimators=n_trees, max_depth=10, min_samples_leaf=10, n_jobs=-1),
            model_t=ExtraTreesClassifier(n_estimators=n_trees, max_depth=10, min_samples_leaf=10, n_jobs=-1),
            criterion='het',
            n_estimators=n_trees,
            discrete_treatment=True,
            categories='auto',
            treatment_featurizer=None,

            min_samples_leaf=10,
            max_samples=0.1,
            min_balancedness_tol=.3,
            max_depth=15,
            min_var_fraction_leaf=0.05,
            min_var_leaf_on_val=True,
            min_impurity_decrease = 0.0,
            inference=True, 
            fit_intercept=True, 
            subforest_size=n_subtrees,
            honest=True, 
            verbose=0, 
            n_jobs=_os.cpu_count())

# training
X.shape
# => (779744, 222)
self.estimator.fit(X=X, T=T, Y=y, sample_weight=sample_weight)

# eval inference
effects = self.estimator.const_marginal_effect(X.to_numpy())

Feb 22 '24 23:02 winston-zillow

EconML EconML copied to clipboard

Discrepancy between 0.14.1 and 0.15.0

EconML
EconML copied to clipboard