dowhy
dowhy copied to clipboard
`distance_matching` estimator raises an exception when binary treatment encoded using `int`s
Describe the bug
When using distance_matching
estimator with binary treatment encoded as int
type (1
s and 0
s rather than True
and False
), DoWhy throws the following exception:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Input In [418], in <cell line: 2>()
1 # Get estimate (Linear Regression)
----> 2 estimate = model.estimate_effect(
3 identified_estimand=estimand,
4 method_name="backdoor.distance_matching",
5 target_units="ate",
6 method_params={'distance_metric': "minkowski", 'p':2})
File ~\anaconda3\envs\causal_book_py38\lib\site-packages\dowhy\causal_model.py:297, in CausalModel.estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
295 if method_params is None:
296 method_params = {}
--> 297 self.causal_estimator = causal_estimator_class(
298 self._data,
299 identified_estimand,
300 self._treatment, self._outcome, #names of treatment and outcome
301 control_value = control_value,
302 treatment_value = treatment_value,
303 test_significance=test_significance,
304 evaluate_effect_strength=evaluate_effect_strength,
305 confidence_intervals = confidence_intervals,
306 target_units = target_units,
307 effect_modifiers = effect_modifiers,
308 **method_params,
309 **extra_args)
310 else:
311 # Estimator had been computed in a previous call
312 assert self.causal_estimator is not None
File ~\anaconda3\envs\causal_book_py38\lib\site-packages\dowhy\causal_estimators\distance_matching_estimator.py:45, in DistanceMatchingEstimator.__init__(self, num_matches_per_unit, distance_metric, exact_match_cols, *args, **kwargs)
43 error_msg = "Distance Matching method is applicable only for binary treatments"
44 self.logger.error(error_msg)
---> 45 raise Exception(error_msg)
47 self.num_matches_per_unit = num_matches_per_unit
48 self.distance_metric = distance_metric
Exception: Distance Matching method is applicable only for binary treatments
Note that when using the same dataset with a binary treatment encoded as bool
no error occurs.
Steps to reproduce the behavior
form scipy import stats
import numpy as np
import pandas as pd
import dowhy
from dowhy import CausalModel
# Generate the data
SAMPLE_SIZE = 10_000
MAX_AGE = 50
age = stats.halfnorm.rvs(loc=19, scale=10, size=SAMPLE_SIZE).astype(int)
age = np.where(age > MAX_AGE, np.random.choice(np.arange(20, MAX_AGE)), age)
took_a_course = stats.bernoulli(p=10/age).rvs()#.astype(bool)
earnings = 75000 + took_a_course * 10000 + age * 1000 + age**2 * 50 + np.random.randn(SAMPLE_SIZE) * 2000
earnings = earnings.round()
# Construct the graph (the graph is constant for all iterations)
nodes = ['took_a_course', 'earnings', 'age']
edges = [
('took_a_course', 'earnings'),
('age', 'took_a_course'),
('age', 'earnings')
]
# Generate the GML graph
gml_string = 'graph [directed 1\n'
for node in nodes:
gml_string += f'\tnode [id "{node}" label "{node}"]\n'
for edge in edges:
gml_string += f'\tedge [source "{edge[0]}" target "{edge[1]}"]\n'
gml_string += ']'
data = pd.DataFrame(dict(
age=age,
took_a_course=took_a_course,
earnings=earnings
))
# Instantiate the CausalModel
model = CausalModel(
data=data,
treatment='took_a_course',
outcome='earnings',
graph=gml_string
)
# Get the estimand
estimand = model.identify_effect()
# Get estimate (Linear Regression)
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name="backdoor.distance_matching",
target_units="ate",
method_params={'distance_metric': "minkowski", 'p':2})
Expected behavior
It should not matter what encoding we use for binary treatment or it should be explicitly specified that only bool
-typed values are accepted and error message should reflect that.
Version information:
- DoWhy version 0.8
Additional context
@AlxndrMlk this is intended behavior. To avoid the tricky task of auto-detecting the type of a variable, we are relying on the pandas dataframe to provide us the correct data type. That is, we expect the user to provide the correct data type. The problem is that sometimes an int variable may have more than two values in the data, and that may raise unknown errors in the estimation.
Great point on having a more informative error message. Would you like to add a PR for that and extend this error message?
@amit-sharma, I see.
How about adding a simple check if the unique values are 0
and 1
? It seems to be a very common way to encode binary treatments in the industry. Would that make sense from your point of view?
I am happy to add a PR, yet I could start working on it Jan 2023. Would that work?
We can consider adding a simple check, but the concern was computational efficiency. When I checked earlier, other libraries tended to loop through the entire data (O(N)) or check until a fixed number of rows (E.g., 10000). I felt it was more transparent expecting the user to specify the type, but I am open to adding functionality that automatically infers it.
PR in early Jan sounds good!
PR in early Jan sounds good!
Great, adding to the calendar and will keep you posted.
We can consider adding a simple check, but the concern was computational efficiency.
Oh, I see, that's a valid concern. If you don't mind, I am happy to look into it and discuss this with you further if we find a sensible solution.
Hi @amit-sharma
I am going back to this topic now.
I want to give one more argument for us to consider before we start working on this.
When using EconML estimators with DoWhy, it is possible to encode treatment as float
/ int
If we restrict DoWhy methods to only bool
this leads to the inconsistency between behaviors of DoWhy vs EconML estimators.
Why this can be problematic?
Imagine we want to create a benchmark for our research paper. We started working with EconML estimators and now we want to add the matching estimator. We need to re-code the data, which might be error prone, especially if the project is already advanced.
Another example - a production system. We want to allow users to use different types of estimators, including the ones that allow for multi-level treatments. We'd naturally encode treatment levels as int
s, but this would require special logic for matching (and possibly other estimators with the same limitation).
As I understand, from your perspective the main concern is that
an int variable may have more than two values in the data
My proposal is that we add explicit checks to make sure that if the treatment is int
/ float
it contains only 0
and 1
(possibly we could also raise a warning when there are only 0
s or only 1
s). I believe that this gives the user more consistency and more flexibility, translating into a smoother user experience.
What are your thoughts on this?
@AlxndrMlk , thanks for coming back to this.
Is this a change for DoWhy's built-in estimators only? Or is it a change to the DoWhy scaffolding around all estimators (whether built-in or from EconML)?
Do you think this should be part of the estimation step, or an explicit preprocessing step that happens before the dataframe is passed into the estimation / fit steps?
I am wondering if this binary example is the only time we are going to run into a similar problem? Are there other things we would want to be checking as long as we are scanning through all the data? (NaNs, nulls, a column that has all the same value....)
@AlxndrMlk thanks for the thoughtful reply. Agree that we should be consistent with EconML as far as possible.
My proposal is that we add explicit checks to make sure that if the treatment is int / float it contains only 0 and 1 (possibly we could also raise a warning when there are only 0s or only 1s). I believe that this gives the user more consistency and more flexibility, translating into a smoother user experience.
Sounds good. So if a user provides bool, then that's great. But if they provide int/float, then we check the values as you write above. If they provide any other values apart from 0/1 (e.g., -1/1), then we raise an error. I think this is reasonable.
@emrekiciman this will be a change to dowhy's builtin estimators (e.g., propensity based) that raise an error if the treatment is not binary type. Currently, we have an explicit raise Error that precludes someone from passing in an int, and @AlxndrMlk is suggesting to remove this super-strict check.
Thanks for clarifying. My first read was that this was a broader data preprocessing / validation problem. This targeted change sounds good to me.
Thanks for the questions @emrekiciman, thanks for clarifying @amit-sharma
Sounds good! Do we expect to experience similar scenario with other DoWhy estimators (beyond the above-mentioned distance_matching
) @amit-sharma (I see you mentioned propensity-based)?
yes, currently distance-matching and the three propensity based estimators check for binary treatment. Would make sense to modify the behavior for all such methods that do a binary treatment check.
@amit-sharma, clear, I am back now and I'll start working on this next week.
@amit-sharma, I updated most of the estimators. Do we have a document with testing guidelines for DoWhy somewhere?
This guide has some pointers to lint/code formatting checkers in PR checklist: https://github.com/py-why/dowhy/blob/main/docs/source/contributing/contributing-code.rst Other than that, I would suggest simply running all the tests. Since this change affects binary versus float treatment, you may want to add a test that includes a float treatment and pass it to one of the propensity methods
Thank you @amit-sharma
Most code changes are ready. As you suggested - I'll add new test cases. I'll be able to get back to this ~2nd half of April.
Will keep you posted.
Hi @amit-sharma
I need to reschedule this contribution to mid/late May.
I'll keep you posted.
Hi @amit-sharma
I have a question regarding the linters and formatting.
The contribution guide recommends using flake8
, black
and isort
to keep the code standards.
When I run them (according to the instructions; poetry run poe lint
, etc.), I get a lot of complains from the linter and black regarding the code from the current master.
The contribution guide recommends fixing these issues automatically.
I am concerned that it will introduce many changes to the current code that might be difficult to review as a part of a pull request.
What are your thoughts on this?
Hey @AlxndrMlk that's okay. for a lot of the earlier code, the auto linters were not activated.
So we've taken a policy of updating them as needed in relevant PRs. No worries on the additional files added to the PR--since they are just formatting changes they should be easy to review.
Thanks for clarifying, @amit-sharma
I understand from what you said, that I should run all the formatting before committing. Is that correct?
yes, all the formatting on the files that your PR touches.
Thanks for clarifying, @amit-sharma, I'll continue this way.
@amit-sharma, created a PR here: https://github.com/py-why/dowhy/pull/987