dowhy
dowhy copied to clipboard
Unable to estimate causal effect with intermediary variable?
I am having some trouble understanding the errors.
Is it not supposed to be possible estimate the causal effect of a graph like this?
Where the treatment in 'error_code' and cause is 'days_on_grace'
Here is what i try to do:
M = pd.DataFrame(
{"error_code": [601, 501, 500, 400, 100],
'grace_period_length': [2, 5, 1, 4, 20],
'days_on_grace': [1, 4, 0, 3, 19]})
import networkx as nx
G = nx.DiGraph()
for n in list(pd.DataFrame(M[['error_code', 'grace_period_length', 'days_on_grace']])):
G.add_node(n)
# Now add 'causes'
G.add_edge('error_code', 'grace_period_length')
G.add_edge('grace_period_length', 'days_on_grace')
gml = list(nx.generate_gml(G))
import dowhy
from dowhy.do_why import CausalModel
# Use graph
treatment = ['error_code']
outcomes = ['days_on_grace']
model = CausalModel(pd.DataFrame(M[['grace_period_length', 'error_code', 'days_on_grace']]),
treatment,
outcomes,
graph="".join(gml))
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
identify_effect seem to always throw an error if the treatment does not have a direct edge to the cause. Why is this?
Error
KeyError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/networkx/classes/digraph.py in remove_edge(self, u, v)
732 try:
--> 733 del self._succ[u][v]
734 del self._pred[v][u]
KeyError: 'days_on_grace'
During handling of the above exception, another exception occurred:
NetworkXError Traceback (most recent call last)
<ipython-input-98-5d361b5e14a2> in <module>
----> 1 identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
2
3 print(identified_estimand)
/usr/lib/python3.6/dist-packages/dowhy/do_why.py in identify_effect(self, proceed_when_unidentifiable)
120 self._estimand_type,
121 proceed_when_unidentifiable=proceed_when_unidentifiable)
--> 122 identified_estimand = self.identifier.identify_effect()
123
124 return identified_estimand
/usr/lib/python3.6/dist-packages/dowhy/causal_identifier.py in identify_effect(self)
22 estimands_dict = {}
23 causes_t = self._graph.get_causes(self.treatment_name)
---> 24 causes_y = self._graph.get_causes(self.outcome_name, remove_edges={'sources':self.treatment_name, 'targets':self.outcome_name})
25 common_causes = list(causes_t.intersection(causes_y))
26 self.logger.info("Common causes of treatment and outcome:" + str(common_causes))
/usr/lib/python3.6/dist-packages/dowhy/causal_graph.py in get_causes(self, nodes, remove_edges)
164 for s in sources:
165 for t in targets:
--> 166 new_graph.remove_edge(s, t)
167 causes = set()
168 for v in nodes:
/usr/local/lib/python3.6/dist-packages/networkx/classes/digraph.py in remove_edge(self, u, v)
734 del self._pred[v][u]
735 except KeyError:
--> 736 raise NetworkXError("The edge %s-%s not in graph." % (u, v))
737
738 def remove_edges_from(self, ebunch):
NetworkXError: The edge error_code-days_on_grace not in graph.
I am sorry if this is the wrong forum to ask this question.
Hey @JonasRSV , thanks for bringing up this example. This kind of an indirect effect graph is more commonly used for estimating causal mediation effects. Since DoWhy currently does not support mediation effects, so the code simply assumes existence of direct edge.
I can answer better if you don't mind providing more details about the goal of your analysis. Can you clarify the effect that you are trying to estimate? From the description, I understand that you want to estimate the effect of error_code on days_on_grace, but in the current graph there are no observed common causes (confounders) and thus it translates to problem with a cause, outcome and no confounders. Is that the correct interpretation?
Yes mediation effect is what i was looking for. This was just an example.
I am looking forward for that feature!
Hey, is this implemented now? can i do mediation analyses using dowhy?
To clarify, i have an edge between treatment and outcome as well as a mediator variable. So i am able to draw the graph.
Not yet @sangyh. Can you share your causal graph and a motivating example of the effect that you want to calculate. Can work on adding it.
I am also interested in the mediation analysis. In Pearl's book, my understanding is that mediation can be addressed by choosing whether to control for the mediator or not. I have the current DAG. Any thoughts on how to develop the mediation myself for the estimation problem?
@samou1 Are you looking to calculate the effect of LCD on T2D? Here's a way to do it.
- The direct effect of LCD (changing from LCD1 to LCD2) on T2D is given by
E[T2D| LCD2, BMI, G, A] - E[T2D, LCD1, BMI, G, A] P(BMI|LCD1, G, A) P(G,A)
where G is gender and A is age, and the above formula is for a specific value of (BMI, G, A). To find the average direct effect, just sum the above formula for each value of (BMI, G, A). - the total effect of LCD on T2D is estimated by conditioning on Gender and Age (backdoor identification TE(LCD))
- The direct effect of BMI on T2D is estimated by conditioning on Gender, Age and LCD (backdoor identification DE(BMI))
@sangyh @samou1 @JonasRSV Mediation effects are now supported in DoWhy! Do try it out and share your feedback. Here's a full example notebook.
Summary
There are two new estimand types in identify_effect
:
- nonparametric-nde: This the natural direct effect of treatment on outcome (T->Y)
- nonparametric-nie: This is the natural indirect effect, mediated by another variable (T->M->Y).
For estimation, the implemented estimator is simple: it is a two stage linear regression estimator. But the API is general, you can specify a first_stage_model
and a second_stage_model
. Will be adding a non-linear estimator soon. Here's a code sample.
For the direct effect of treatment on outcome
# Natural direct effect (nde)
identified_estimand_nde = model.identify_effect(estimand_type="nonparametric-nde",
proceed_when_unidentifiable=True)
print(identified_estimand_nde)
import dowhy.causal_estimators.linear_regression_estimator
causal_estimate_nde = model.estimate_effect(identified_estimand_nde,
method_name="mediation.two_stage_regression",
confidence_intervals=False,
test_significance=False,
method_params = {
'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,
'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator
}
)
print(causal_estimate_nde)
For the indirect effect of treatment on outcome
# Natural indirect effect (nie)
identified_estimand_nie = model.identify_effect(estimand_type="nonparametric-nie",
proceed_when_unidentifiable=True)
print(identified_estimand_nie)
causal_estimate_nie = model.estimate_effect(identified_estimand_nie,
method_name="mediation.two_stage_regression",
confidence_intervals=False,
test_significance=False,
method_params = {
'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,
'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator
}
)
print(causal_estimate_nie)
The frontdoor criterion is also supported through the same two stage estimator. To use frontdoor, write:
import dowhy.causal_estimators.linear_regression_estimator
causal_estimate = model.estimate_effect(identified_estimand,
method_name="frontdoor.two_stage_regression",
confidence_intervals=False,
test_significance=False,
method_params = {
'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,
'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator
}
)
print(causal_estimate)
For a full code example, you can check out the notebook on mediation effects with DoWhy: https://github.com/microsoft/dowhy/blob/master/docs/source/example_notebooks/dowhy_mediation_analysis.ipynb
Hi Amit, thanks for the update and implementing this. To clarify, this is the Baron and Kenny approach to mediation and not pearl's approach? In this case, I would need some tests for linearity I presume.
@sangyh yes, the estimator implements the Baron and Kenny approach. However the modeling and identification steps before it are done using Pearl's approach. So given a causal graph with mediation (and other confounders), DoWhy can find out the right variables to include in the regression formula.
I also plan to add the non-parametric estimator based on Pearl's identification results. That should be implemented in the coming weeks. The linear case was the simplest to implement, so I started with that.
Thanks Amit. I realized i have a confounder causing the mediator and outcome variables, so afraid BK approach will not work. I will try implementing pearl's approach if you haven't already implemented this in DoWhy. In your comment to @samou1, what is 'each value of (BMI, G, A)' when all 3 are continuous variables?
When all three are continuous variables, then the sum for each value of (BMI, G,A) becomes an integration over the same variables, weighted by the probability P(BMI, G, A). If integration is numerically difficult, you can discretize the variables to reasonable buckets and then try.
Unfortunately it may take a few weeks before the Pearlian non-parametric estimator is implemented. Do let me know how your implementation goes for this estimator @sangyh .
Hi all!
I am struggling to identify the correct estimand when using multiple mediators:
I am using Gender_Male as a treatment and Hourly_Salary as an outcome. And I am interested in the natural direct vs. natural indirect effects.
When running: model.identify_effect(estimand_type="nonparametric-nde"), I only get the estimand for ONE mediator, which seems to be randomly selected:
Can someone explain this behavior? Can dowhy not handle multiple mediators? Thank you very much in advance!