dowhy icon indicating copy to clipboard operation
dowhy copied to clipboard

Non-linear causal relationships

Open jmafoster1 opened this issue 4 years ago • 3 comments

I have been using doWhy with some data for which the relationship between cause and effect is not a straight line. My treatment variable is called L1 and I have an effect modifier, g. There are no confounders here, and my identified estimand is L1+L1*g. Using backdoor.linear_regressions as an estimator gives the coefficients.

      const    17.495757
      x1       -0.004714
      x2       -0.046194

I think this corresponds to 17.495757 + (-0.004714 * L1) + (-0.046194 * L1 * g), but this is a straight line, as shown in the following graph. The scatter plot shows the actual data (L1 on x, outcome on y), and the red line represents the estimator from doWhy. Actually it's a flat plane, I guess, but a straight line when you only view L1 and the outcome.

graph

I appreciate that backdoor.linear_regression is probably not the best estimator to use for this as the actual relationship is in sqrt(x), but the more advanced estimators from EconML don't really do any better and I can get a reasonably OK fit in either L1^3 or L1^4 using scikit learn. I wondered how/whether it is possible to use dowhy for such non-linear relationships?

jmafoster1 avatar Oct 04 '21 10:10 jmafoster1

That's an interesting question. A simple answer is to pre-process your variables. E.g., you can transform L1 to sqrt(L1) or L1^2 before passing it to DoWhy.

However, non-linear methods should be able to do this automatically. I'm surprised that the EconML methods did not work. Can you share which ones you tried? If you can share some sample data/generation code, I can try to reproduce this.

amit-sharma avatar Oct 20 '21 09:10 amit-sharma

Thanks for getting back to me on this. For the simple solution, would I need to include sqrt(L1) etc. as an additional treatment (and variable in the causal model) to get it to pick it up in the estimand? This feels a little uncomfortable since sqrt(L1) is, intuitively, not a separate treatment.

For EconML, I tried both backdoor.econml.dml.DML and backdoor.econml.dml.CausalForestDML both of which produce effect estimates which are incorrect by many orders of magnitude. I'm expecting the estimate for control value 1 and treatment value 2 to be around -8, and both give me estimates of order 10^12! I also tried backdoor.generalized_linear_model with all of the available families. Most produce estimates which are in the correct order of magnitude, but most are still way off (relatively speaking).

Here's the CSV file, causal DAG, and a minimal python script to calculate the estimates. I had to change the extensions of the latter two to allow git to upload them. rct_data.csv abstract.txt (change extension back to .dot) estimates-mwe.txt (change extension back to .py)

jmafoster1 avatar Oct 20 '21 13:10 jmafoster1

thanks. Since this seems to be connected to EconML methods, I'm also tagging @moprescu from the EconML team who may be able to help.

amit-sharma avatar Oct 21 '21 06:10 amit-sharma