dowhy icon indicating copy to clipboard operation
dowhy copied to clipboard

Estimate Effect fails with Econml DML estimator

Open andresmor-ms opened this issue 2 years ago • 5 comments

Describe the bug Executing estimate_effect with categorical data and backdoor.econml.dml.DML estimator fails with error: KeyError: "['x7'] not in index".

Steps to reproduce the behavior Code example to reproduce the error:

import numpy as np
import pandas as pd
from dowhy import CausalModel
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import PolynomialFeatures
graph_str = """
digraph {x0;x1;x2;x3;x4;x5;x6;x7;x8;x9;x10;x11;
x0 -> x1;
x0 -> x9;
x0 -> x11;
x1 -> x10;
x1 -> x11;
x2 -> x0;
x2 -> x1;
x3 -> x0;
x3 -> x1;
x4 -> x0;
x4 -> x1;
x5 -> x0;
x6 -> x0;
x7 -> x1;
x8 -> x1}
"""
length = 100
d = {
    "x0": np.random.binomial(1, 0.5, size=length),
    "x1": np.random.randn(length),
    "x2": np.random.randint(3, size=length),
    "x3": np.random.randn(length),
    "x4": np.random.binomial(1, 0.5, size=length),
    "x5": np.random.randint(3, size=length),
    "x6": np.random.randn(length),
    "x7": np.random.randint(3, size=length),
    "x8": np.random.randn(length),
    "x9": np.random.randn(length),
    "x10": np.random.randn(length),
    "x11": np.random.randn(length),
}
df = pd.DataFrame(d)
df = df.astype({name: "category" for name in ["x0", "x2", "x4", "x5", "x7"]}, copy=False)
model = CausalModel(data=df, treatment="x0", outcome="x1", graph=graph_str)
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True, method_name="minimal-adjustment")
model.estimate_effect(
    identified_estimand,
    method_name="backdoor.econml.dml.DML",
    method_params={
        "init_params": {
            "model_y": GradientBoostingRegressor(),
            "model_t": GradientBoostingClassifier(),
            "model_final": LassoCV(fit_intercept=False),
            "featurizer": PolynomialFeatures(degree=2, include_bias=True),
            "discrete_treatment": True,
        },
        "fit_params": {},
    },
)

Expected behavior In previous versions of dowhy this works as expected (a CausalEstimate is returned instead of an error). (e.g., dowhy rev: f523accf96d6bb1afa6f3f3f3ba13cb1272b4150)

Version information:

  • DoWhy version 0.9 and master branch

Additional context After investigation I found out that the error above happens after the changes in commit #768.

andresmor-ms avatar Jan 11 '23 21:01 andresmor-ms

Hi, ran into similar issues. Thanks for the reproducible error, I'll take a look.

MichaelMarien avatar Jan 15 '23 12:01 MichaelMarien

I discovered that this error can be removed by adding: self._effect_modifier_names = list(self._effect_modifiers.columns) to the _set_effect_modifiers in the causal_estimator.py file, but I'm not sure if this would change the results and make it return an incorrect answer.

andresmor-ms avatar Jan 16 '23 16:01 andresmor-ms

Yes, the issue appears in line https://github.com/py-why/dowhy/blob/325cf4e245de3e55b85a42c5fefc36f6ef34db46/dowhy/causal_estimator.py#L137

where the 'dummies' method creates columns x7_1, x7_2. Unfortunately the same dummification process is not applied to the data itself, which throws an error in line https://github.com/py-why/dowhy/blob/325cf4e245de3e55b85a42c5fefc36f6ef34db46/dowhy/causal_estimators/econml.py#L307 as df only knows x7. We need to keep them aligned, there are multiple options, not sure about the best one myself.

MichaelMarien avatar Jan 18 '23 19:01 MichaelMarien

yeah, the reason is misalignment between effect_modifier_names and the data effect_modifiers. We could follow @andresmor-ms recommendation, but I wonder if that can affect other estimators. Also, effect_modifier_names is supposed to be a user-interpretable list of variables---expanding it to include dummy variables can make it too big a list.

For EconML only, I suggest the following simple change: just remove effect_modifier_names from this line, https://github.com/py-why/dowhy/blob/325cf4e245de3e55b85a42c5fefc36f6ef34db46/dowhy/causal_estimators/econml.py#L307

and write,

filtered_df = df.values 

since df being provided is always X and derived from effect_modifiers in estimate_effect, this should work as well. @andresmor-ms can you try that and check if it solves the issue?

amit-sharma avatar Jan 20 '23 11:01 amit-sharma

@amit-sharma I just created this draft PR about this issue: https://github.com/py-why/dowhy/pull/828 I had to modify another line to make the tests pass, but I'm not sure if that would cause the same issue when calling that particular function of the econml estimator, could you please take a look?

That function is only used in the tests, I avoided modifying the tests and only modified the estimator class, I guess that you would need to call the effect_tt function with the right dataframe for it to work.

andresmor-ms avatar Jan 23 '23 22:01 andresmor-ms