dowhy `distance_matching` estimator raises an exception when binary treatment encoded using `int`s

Describe the bug When using distance_matching estimator with binary treatment encoded as int type (1s and 0s rather than True and False), DoWhy throws the following exception:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Input In [418], in <cell line: 2>()
      1 # Get estimate (Linear Regression)
----> 2 estimate = model.estimate_effect(
      3     identified_estimand=estimand,
      4     method_name="backdoor.distance_matching",
      5     target_units="ate",
      6     method_params={'distance_metric': "minkowski", 'p':2})

File ~\anaconda3\envs\causal_book_py38\lib\site-packages\dowhy\causal_model.py:297, in CausalModel.estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
    295     if method_params is None:
    296         method_params = {}
--> 297     self.causal_estimator = causal_estimator_class(
    298         self._data,
    299         identified_estimand,
    300         self._treatment, self._outcome, #names of treatment and outcome
    301         control_value = control_value,
    302         treatment_value = treatment_value,
    303         test_significance=test_significance,
    304         evaluate_effect_strength=evaluate_effect_strength,
    305         confidence_intervals = confidence_intervals,
    306         target_units = target_units,
    307         effect_modifiers = effect_modifiers,
    308         **method_params,
    309         **extra_args)
    310 else:
    311     # Estimator had been computed in a previous call
    312     assert self.causal_estimator is not None

File ~\anaconda3\envs\causal_book_py38\lib\site-packages\dowhy\causal_estimators\distance_matching_estimator.py:45, in DistanceMatchingEstimator.__init__(self, num_matches_per_unit, distance_metric, exact_match_cols, *args, **kwargs)
     43     error_msg = "Distance Matching method is applicable only for binary treatments"
     44     self.logger.error(error_msg)
---> 45     raise Exception(error_msg)
     47 self.num_matches_per_unit = num_matches_per_unit
     48 self.distance_metric = distance_metric

Exception: Distance Matching method is applicable only for binary treatments

Note that when using the same dataset with a binary treatment encoded as bool no error occurs.

Steps to reproduce the behavior

form scipy import stats
import numpy as np
import pandas as pd
import dowhy
from dowhy import CausalModel

# Generate the data
SAMPLE_SIZE = 10_000
MAX_AGE = 50

age = stats.halfnorm.rvs(loc=19, scale=10, size=SAMPLE_SIZE).astype(int)
age = np.where(age > MAX_AGE, np.random.choice(np.arange(20, MAX_AGE)), age)

took_a_course = stats.bernoulli(p=10/age).rvs()#.astype(bool)

earnings = 75000 + took_a_course * 10000 + age * 1000 + age**2 * 50 + np.random.randn(SAMPLE_SIZE) * 2000
earnings = earnings.round()


# Construct the graph (the graph is constant for all iterations)
nodes = ['took_a_course', 'earnings', 'age']
edges = [
    ('took_a_course', 'earnings'),
    ('age', 'took_a_course'),
    ('age', 'earnings')
]

# Generate the GML graph
gml_string = 'graph [directed 1\n'

for node in nodes:
    gml_string += f'\tnode [id "{node}" label "{node}"]\n'

for edge in edges:
    gml_string += f'\tedge [source "{edge[0]}" target "{edge[1]}"]\n'
    
gml_string += ']'

data = pd.DataFrame(dict(
    age=age,
    took_a_course=took_a_course,
    earnings=earnings
))


# Instantiate the CausalModel 
model = CausalModel(
    data=data,
    treatment='took_a_course',
    outcome='earnings',
    graph=gml_string
)


# Get the estimand
estimand = model.identify_effect()

# Get estimate (Linear Regression)
estimate = model.estimate_effect(
    identified_estimand=estimand,
    method_name="backdoor.distance_matching",
    target_units="ate",
    method_params={'distance_metric': "minkowski", 'p':2})

Expected behavior It should not matter what encoding we use for binary treatment or it should be explicitly specified that only bool-typed values are accepted and error message should reflect that.

Version information:

DoWhy version 0.8

Additional context

Nov 22 '22 07:11 AlxndrMlk

@AlxndrMlk this is intended behavior. To avoid the tricky task of auto-detecting the type of a variable, we are relying on the pandas dataframe to provide us the correct data type. That is, we expect the user to provide the correct data type. The problem is that sometimes an int variable may have more than two values in the data, and that may raise unknown errors in the estimation.

Great point on having a more informative error message. Would you like to add a PR for that and extend this error message?

Nov 23 '22 05:11 amit-sharma

@amit-sharma, I see.

How about adding a simple check if the unique values are 0 and 1? It seems to be a very common way to encode binary treatments in the industry. Would that make sense from your point of view?

I am happy to add a PR, yet I could start working on it Jan 2023. Would that work?

Nov 23 '22 05:11 AlxndrMlk

We can consider adding a simple check, but the concern was computational efficiency. When I checked earlier, other libraries tended to loop through the entire data (O(N)) or check until a fixed number of rows (E.g., 10000). I felt it was more transparent expecting the user to specify the type, but I am open to adding functionality that automatically infers it.

PR in early Jan sounds good!

Nov 23 '22 05:11 amit-sharma

PR in early Jan sounds good!

Great, adding to the calendar and will keep you posted.

We can consider adding a simple check, but the concern was computational efficiency.

Oh, I see, that's a valid concern. If you don't mind, I am happy to look into it and discuss this with you further if we find a sensible solution.

Nov 23 '22 05:11 AlxndrMlk

Hi @amit-sharma

I am going back to this topic now.

I want to give one more argument for us to consider before we start working on this.

When using EconML estimators with DoWhy, it is possible to encode treatment as float / int

If we restrict DoWhy methods to only bool this leads to the inconsistency between behaviors of DoWhy vs EconML estimators.

Why this can be problematic?

Imagine we want to create a benchmark for our research paper. We started working with EconML estimators and now we want to add the matching estimator. We need to re-code the data, which might be error prone, especially if the project is already advanced.

Another example - a production system. We want to allow users to use different types of estimators, including the ones that allow for multi-level treatments. We'd naturally encode treatment levels as ints, but this would require special logic for matching (and possibly other estimators with the same limitation).

As I understand, from your perspective the main concern is that

an int variable may have more than two values in the data

My proposal is that we add explicit checks to make sure that if the treatment is int / float it contains only 0 and 1 (possibly we could also raise a warning when there are only 0s or only 1s). I believe that this gives the user more consistency and more flexibility, translating into a smoother user experience.

What are your thoughts on this?

Jan 12 '23 13:01 AlxndrMlk

@AlxndrMlk , thanks for coming back to this.

Is this a change for DoWhy's built-in estimators only? Or is it a change to the DoWhy scaffolding around all estimators (whether built-in or from EconML)?

Do you think this should be part of the estimation step, or an explicit preprocessing step that happens before the dataframe is passed into the estimation / fit steps?

I am wondering if this binary example is the only time we are going to run into a similar problem? Are there other things we would want to be checking as long as we are scanning through all the data? (NaNs, nulls, a column that has all the same value....)

Jan 17 '23 08:01 emrekiciman

@AlxndrMlk thanks for the thoughtful reply. Agree that we should be consistent with EconML as far as possible.

My proposal is that we add explicit checks to make sure that if the treatment is int / float it contains only 0 and 1 (possibly we could also raise a warning when there are only 0s or only 1s). I believe that this gives the user more consistency and more flexibility, translating into a smoother user experience.

Sounds good. So if a user provides bool, then that's great. But if they provide int/float, then we check the values as you write above. If they provide any other values apart from 0/1 (e.g., -1/1), then we raise an error. I think this is reasonable.

@emrekiciman this will be a change to dowhy's builtin estimators (e.g., propensity based) that raise an error if the treatment is not binary type. Currently, we have an explicit raise Error that precludes someone from passing in an int, and @AlxndrMlk is suggesting to remove this super-strict check.

Jan 20 '23 11:01 amit-sharma

Thanks for clarifying. My first read was that this was a broader data preprocessing / validation problem. This targeted change sounds good to me.

Jan 20 '23 16:01 emrekiciman

Thanks for the questions @emrekiciman, thanks for clarifying @amit-sharma

Sounds good! Do we expect to experience similar scenario with other DoWhy estimators (beyond the above-mentioned distance_matching) @amit-sharma (I see you mentioned propensity-based)?

Jan 20 '23 18:01 AlxndrMlk

yes, currently distance-matching and the three propensity based estimators check for binary treatment. Would make sense to modify the behavior for all such methods that do a binary treatment check.

Jan 23 '23 04:01 amit-sharma

@amit-sharma, clear, I am back now and I'll start working on this next week.

Feb 09 '23 12:02 AlxndrMlk

@amit-sharma, I updated most of the estimators. Do we have a document with testing guidelines for DoWhy somewhere?

Feb 24 '23 19:02 AlxndrMlk

This guide has some pointers to lint/code formatting checkers in PR checklist: https://github.com/py-why/dowhy/blob/main/docs/source/contributing/contributing-code.rst Other than that, I would suggest simply running all the tests. Since this change affects binary versus float treatment, you may want to add a test that includes a float treatment and pass it to one of the propensity methods

Feb 27 '23 03:02 amit-sharma

Thank you @amit-sharma

Most code changes are ready. As you suggested - I'll add new test cases. I'll be able to get back to this ~2nd half of April.

Will keep you posted.

Mar 10 '23 13:03 AlxndrMlk

Hi @amit-sharma

I need to reschedule this contribution to mid/late May.

I'll keep you posted.

Apr 28 '23 06:04 AlxndrMlk

Hi @amit-sharma

I have a question regarding the linters and formatting.

The contribution guide recommends using flake8, black and isort to keep the code standards.

When I run them (according to the instructions; poetry run poe lint, etc.), I get a lot of complains from the linter and black regarding the code from the current master.

The contribution guide recommends fixing these issues automatically.

I am concerned that it will introduce many changes to the current code that might be difficult to review as a part of a pull request.

What are your thoughts on this?

Jul 08 '23 18:07 AlxndrMlk

Hey @AlxndrMlk that's okay. for a lot of the earlier code, the auto linters were not activated.

So we've taken a policy of updating them as needed in relevant PRs. No worries on the additional files added to the PR--since they are just formatting changes they should be easy to review.

Jul 12 '23 12:07 amit-sharma

Thanks for clarifying, @amit-sharma

I understand from what you said, that I should run all the formatting before committing. Is that correct?

Jul 12 '23 17:07 AlxndrMlk

yes, all the formatting on the files that your PR touches.

Jul 13 '23 06:07 amit-sharma

Thanks for clarifying, @amit-sharma, I'll continue this way.

Jul 13 '23 13:07 AlxndrMlk

@amit-sharma, created a PR here: https://github.com/py-why/dowhy/pull/987

Jul 21 '23 14:07 AlxndrMlk

dowhy dowhy copied to clipboard

`distance_matching` estimator raises an exception when binary treatment encoded using `int`s

Additional context

dowhy
dowhy copied to clipboard