EconML icon indicating copy to clipboard operation
EconML copied to clipboard

Best practices to handle NaN/missing values in W or X features?

Open ghost opened this issue 4 years ago • 6 comments

Hi! Thanks for developing this powerful package. I noticed that with the Orthogonal/Double Machine Learning estimators, they do not accept any missing/NaN values in my features (W & X), even if I specified my 1st-stage Y_model& T_model to be Xgboost classifiers/regressors imported from the xgboost package which the xgboost package implementation alone supports / accepts NaN feature values by branching along NaN as a category.

If I do not impute and call Double ML estimators, I will bump into the error shown in the attached screenshot. Looks like the error is caused by calling a sklearn validation.py script that does not accept missing values?

Based on your hands-on experience, what would be the best practices to impute the NaN in my W & X features? Take the median value of a feature or 0 if it's continuous variable? Or simply replace NaNs with string value 'nan' so that the features might be just considered as a categorical feature when fitting XGBoost or other boosting trees such as CatBoost or LiteGBM?

Or can you help point me to the source code where this can be modified so that the Double ML estimators can accept NaN by default?

Thank you! Screen Shot 2021-02-19 at 12 41 05 PM

ghost avatar Feb 19 '21 17:02 ghost

Also running into the same issue. XGBoost can compute the propensity scores and/or regress the mean response variable, even if some of the confounder values are null. This check seems to be blocking the use of XGBoost with null values unnecessarily.

morelandjs avatar Sep 20 '21 16:09 morelandjs

Also running on the same issue using LightGBM. This forces me to impute missing values unnecessarily and may even impact the performance of my models due to bad imputations.

mbessier avatar Nov 05 '21 10:11 mbessier

I agree it's important for econml to accept missing values to support algorithms that directly handle missing values (i.e. most notably xgboost, lightgbm and catboost). Forcing the imputation of missing values is non-optimal in many circumstances. causalml already supports this functionality.

esbraun avatar Nov 29 '21 20:11 esbraun

Agreed with @esbraun - even if we could use scikit learn pipeline estimators with imputation stages this will help greatly. You could imagine a form of multiple imputation using this strategy with more MC steps in the various metalearners. All that would need to change would be to allow NaN to pass through to the input estimators, no need to call the scikit learn checks too early like here:

https://github.com/microsoft/EconML/blob/7dd7683c987018511a07318b6f4b165018373aad/econml/utilities.py#L544

The underlying estimator could handle the inputs when appropriate.

dsteinberg avatar Feb 18 '22 01:02 dsteinberg

Any update on this issure? I'm using catboost as underling model. Which also support NaN feature values.

mshijie avatar Jun 09 '22 05:06 mshijie

Thinking about this more, I'm concerned that not allowing the propagation of NaNs can actually lead to bias/overconfidence. See issue #664.

dsteinberg avatar Aug 18 '22 00:08 dsteinberg

Hi @moprescu @kbattocchi

Would it be possible to get a solution to this problem? I would very much appreciate it. Maybe just send a warning or a guide in https://econml.azurewebsites.net/spec/estimation/dml.html?

olamagnusandersson avatar May 23 '23 15:05 olamagnusandersson

Seconding @olamagnusandersson. I’d still rather see a warning than an error thrown for the reasons below.

I agree it's important for econml to accept missing values to support algorithms that directly handle missing values (i.e. most notably xgboost, lightgbm and catboost). Forcing the imputation of missing values is non-optimal in many circumstances. causalml already supports this functionality.

esbraun avatar May 23 '23 19:05 esbraun

Hi, just wondering if there has been any solution to this problem yet? I am facing the same issues with NaN in X

vs759 avatar Jun 02 '23 12:06 vs759

Thanks all for your feedback. We currently have a PR in progress to enable missing values for W. Enabling missing values in X is less clear, see message from a discussion in our Discord.

I agree that it would be reasonable for us to address this, but note that for many of our estimators this would really only work for W and not for X (e.g. for LinearDML, our second stage model is running a regression of Y_res on (T_res cross X), so any NaNs in X will be a problem even if the first stage model handles them without issue).

fverac avatar Jul 19 '23 16:07 fverac