Pgmpy does not work with unobserved confounders
Subject of the issue
There is no way to get pgmpy working with unobserved counfounders to calculate the ate correctly
Your environment
- pgmpy version 0.1.19
- Python 3.8
- Windows 10
Steps to reproduce
import pandas as pd import numpy as np from pgmpy.models import BayesianNetwork import pgmpy.inference
import warnings warnings.simplefilter(action='ignore', category=FutureWarning)
RANDOM_SEED = 1 sample_size = 1000
np.random.seed(RANDOM_SEED) U = np.random.randn(sample_size) # Unobserved confounder X = 0.3U + np.random.randn(sample_size) Z = 0.78X + 0.3np.random.randn(sample_size) Y = 0.45U + 0.87Z + 0.4np.random.randn(sample_size)
data = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})
network = BayesianNetwork([("X", "Z"), ("Z", "Y"), ("U", "X"), ("U", "Y")]) inference = pgmpy.inference.CausalInference(model=network)
print(inference.get_all_backdoor_adjustment_sets("X", "Y")) print(inference.get_all_frontdoor_adjustment_sets("X", "Y"))
print(f"Actual effect: {0.78 * 0.87} (The effect of X on Z multipled by the effect of Z on Y)")
inference = pgmpy.inference.CausalInference(model=network) ate = inference.estimate_ate(X="X", Y="Y", data=data, estimator_type="linear") print(ate)
Expected behaviour
Tell us what should happen The calculated ate should be something close to 0.6786.
Actual behaviour
KeyError: "['U'] not in index"
It is possible to specify latents=["U"] when instantiating the BayesianNetwork or to get rid of the edges to "U" The result is exactly the same: the ate is calculated as 0.8047575187358673 which is not correct and the frozenset for "U" is undetected.
Also to note: this works in dowhy. In dowhy you can specify a network with unobserved confounders and leave them out of the data. The algorithm will then correctly de-confound and calculate the correct result.
This is annoying because I much prefer pgmpy to dowhy but I can find no way to correctly calculate the ate where unobserved confounders are involved.
If you could show me how to solve this I would be very grateful indeed!
Many thanks
Graham
@grahamharrison68 Thanks for reporting this. I had a quick look and seems that the estimate_ate method currently works only if there's a backdoor adjustment set which is not the case in your example. I will see if I can do a quick fix for this.
Hello @ankurankan,
Thanks for picking this up.
I was wondering if there has been any progress? If I change the network definition as follows - network = BayesianNetwork([("X", "Z"), ("Z", "Y"), ("U", "X"), ("U", "Y")], latents=["U"]) Then the code does not crash but the results are still wrong and are exactly the same as if I defined the network by omitting "U" altogether.
I am about to publish an article about pgmpy vs. DoWhy and there are two areas were pgmpy currently lags behind -
- The unobserved confounders issue reported here.
- DoWhy will work with categorical and continuous variables natively. In pgmpy continuous variables are represented as a separate row in the CPD for every value and categorical variables need to be split up into "bins".
If these two issues could be addressed the pgmpy would be right up there with DoWhy and in fact it coulc be out in front as it is much easier to use.
Thanks and my very best regards
Graham
@grahamharrison68 Sorry, this took me a while but should work now. Here's the output from the example you mentioned:
In [2]: from pgmpy.models import BayesianNetwork
In [3]: from pgmpy.inference import CausalInference
In [4]: RANDOM_SEED = 1
...: sample_size = 1000
...:
...: np.random.seed(RANDOM_SEED)
...: U = np.random.randn(sample_size) # Unobserved confounder
...: X = 0.3*U + np.random.randn(sample_size)
...: Z = 0.78*X + 0.3*np.random.randn(sample_size)
...: Y = 0.45*U + 0.87*Z + 0.4*np.random.randn(sample_size)
...:
...: data = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})
In [5]: network = BayesianNetwork([("X", "Z"), ("Z", "Y"), ("U", "X"), ("U", "Y")], latents=['U'])
In [6]: infer = CausalInference(network)
In [7]: infer.estimate_ate('X', 'Y', data, estimator_type='linear')
Out[7]: 0.6599812388166989
Yeah, pgmpy doesn't have full support (some structure learning and causal estimation is supported) for continuous variables yet but I do plan to extend the SEM class (which is essentially a Gaussian BN), which would also allow us to also do parameter estimation and probabilistic inference for continuous variables. I also plan to extend some of the instrumental variable methods already implemented for SEMs to also work with DAG-based models.
Hello @ankurankan. Thanks for looking at this. I have used pip to upgrade to the latest version of pgmpy which is now 0.1.20 and then run the code above. I still get 0.8047575187358673 and not 0.6599812388166989 which is produced in your example above so I am still getting the wrong answer. Please could I ask if you have released the codebase yet that has the fix in it that you have demonstrated above? Thanks and very best regards. Graham
@grahamharrison68 Sorry, I forgot to mention that this is still in the dev branch and should be released in version 0.1.21 (scheduled on 31st Dec). You can install the dev branch in the meanwhile using:
pip install git+https://github.com/pgmpy/pgmpy.git@dev
Hello @ankurankan, your change has certainly worked for a front-door example but I still get the key error when I try for an instrumental variable.
Here is the code that reproduces the error ...
np.random.seed(RANDOM_SEED) U = np.random.randn(SAMPLE_SIZE) Z = np.random.randn(SAMPLE_SIZE) X = 0.6Z + 0.43U + np.random.randn(SAMPLE_SIZE) Y = 0.45U + 0.87X + 0.4*np.random.randn(SAMPLE_SIZE)
data_3 = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})
dag_3_pgmpy = BayesianNetwork([("Z", "X"), ("X", "Y"), ("U", "X"), ("U", "Y")])
inference = pgmpy.inference.CausalInference(model=dag_3_pgmpy) ate = inference.estimate_ate(X="X", Y="Y", data=data_3, estimator_type="linear") print(ate)
@grahamharrison68 Yeah, the IV support is a bit limited right now and only works for the SEM class: https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/estimators/SEMEstimator.py#L348. I will try to prioritize this to make it work on all models.