EconML icon indicating copy to clipboard operation
EconML copied to clipboard

CausalForestDML with binary outcome and treatment

Open kayoungcarmen opened this issue 1 year ago • 3 comments

Hi, I'm building a causal forest with binary outcome and binary treatment. I have sufficient observations (over 100K) for two groups, but the model doesn't seem to work well because const_marginal_ate is -0.00152163 and feature_importances_ returns an array of zeros looking like array([0., 0., 0.,...

Could you let me know a) CausalForestDML is the right method to use for binary outcome and binary treatment and b) the model setting below is correct?

# set variables for causal forest 
Y = train[one variable]
T = train[one variable]
X = train[20 variables]
W = None
X_test = test[20 variables]

# set parameters for causal forest 
est = CausalForestDML(criterion='het',
                                min_impurity_decrease=0.001,
                                n_estimators=1000,       
                                min_samples_leaf=10, 
                                max_depth=None, 
                                max_samples=0.5,
                                discrete_treatment=True,
                                honest=True,
                                inference=True,
                                cv=5,
                                model_t=RandomForestClassifier(random_state=0), 
                                model_y=RandomForestClassifier(random_state=0),
                                )

# Fit the model
est.fit(Y, T, X=X, W=W)

kayoungcarmen avatar Jun 08 '23 15:06 kayoungcarmen

A near-zero const_marginal_ate doesn't seem inherently problematic - maybe there's just very little impact of treatment (and also, you should look at the confidence intervals to get some sense of whether large effects in either direction are actually ruled out). However, here are some other thoughts:

  1. DML does not directly model discrete outcome data (as opposed to discrete treatment data); that doesn't mean that you can't use DML when your outcome data is discrete (though the interpretation of the effect might be more naturally thought of as a change in likelihood than a direct effect), but it does mean that DML's first stage logic is just calling predict on whatever outcome model you pass in - for a classifier this is probably not ideal, since predict_proba would give finer-grained information. In the binary case, and with something like random forests where both classifiers and regressors exist, the easiest solution would be to just switch to using the regressor instead (for the y model only, when discrete_treatment is True we automatically handle this for the t model).
  2. If what you primarily care about is the ATE, then consider moving your variables from X to W instead and just using LinearDML (since CausalForestDML does not support X=None). In general, the more variables there are in X the harder the statistical problem becomes, and unless you have a lot of meaningful treatment variation it will be very hard to estimate the final model (and CausalForestDML can fit a much more flexible model than LinearDML, which is good if the truth is very non-linear, but if the residuals are noisy then it may overfit or converge much more slowly than a linear model).

kbattocchi avatar Jun 08 '23 15:06 kbattocchi

@kbattocchi Thank you so much for the answer. My follow-up questions are

  • For the binary outcome and binary treatment problems, which modeling techniques would you recommend that are offered by EconML if not CausalForestDML?
  • My primary focus is to find the subsets that are more significantly affected by the treatment than others. Would you still recommend moving my variables from X to W and using LinearDML in this case?

kayoungcarmen avatar Jun 08 '23 19:06 kayoungcarmen

Hello, Did u find a right model for binary outcome using econML. I am in the same situation.

jaimesempertegui avatar Dec 16 '23 00:12 jaimesempertegui