pgmpy icon indicating copy to clipboard operation
pgmpy copied to clipboard

Pgmpy does not work with unobserved confounders

Open grahamharrison68 opened this issue 3 years ago • 7 comments

Subject of the issue

There is no way to get pgmpy working with unobserved counfounders to calculate the ate correctly

Your environment

  • pgmpy version 0.1.19
  • Python 3.8
  • Windows 10

Steps to reproduce

import pandas as pd import numpy as np from pgmpy.models import BayesianNetwork import pgmpy.inference

import warnings warnings.simplefilter(action='ignore', category=FutureWarning)

RANDOM_SEED = 1 sample_size = 1000

np.random.seed(RANDOM_SEED) U = np.random.randn(sample_size) # Unobserved confounder X = 0.3U + np.random.randn(sample_size) Z = 0.78X + 0.3np.random.randn(sample_size) Y = 0.45U + 0.87Z + 0.4np.random.randn(sample_size)

data = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})

network = BayesianNetwork([("X", "Z"), ("Z", "Y"), ("U", "X"), ("U", "Y")]) inference = pgmpy.inference.CausalInference(model=network)

print(inference.get_all_backdoor_adjustment_sets("X", "Y")) print(inference.get_all_frontdoor_adjustment_sets("X", "Y"))

print(f"Actual effect: {0.78 * 0.87} (The effect of X on Z multipled by the effect of Z on Y)")

inference = pgmpy.inference.CausalInference(model=network) ate = inference.estimate_ate(X="X", Y="Y", data=data, estimator_type="linear") print(ate)

Expected behaviour

Tell us what should happen The calculated ate should be something close to 0.6786.

Actual behaviour

KeyError: "['U'] not in index"

It is possible to specify latents=["U"] when instantiating the BayesianNetwork or to get rid of the edges to "U" The result is exactly the same: the ate is calculated as 0.8047575187358673 which is not correct and the frozenset for "U" is undetected.

Also to note: this works in dowhy. In dowhy you can specify a network with unobserved confounders and leave them out of the data. The algorithm will then correctly de-confound and calculate the correct result.

This is annoying because I much prefer pgmpy to dowhy but I can find no way to correctly calculate the ate where unobserved confounders are involved.

If you could show me how to solve this I would be very grateful indeed!

Many thanks

Graham

grahamharrison68 avatar Oct 09 '22 07:10 grahamharrison68

@grahamharrison68 Thanks for reporting this. I had a quick look and seems that the estimate_ate method currently works only if there's a backdoor adjustment set which is not the case in your example. I will see if I can do a quick fix for this.

ankurankan avatar Oct 18 '22 13:10 ankurankan

Hello @ankurankan,

Thanks for picking this up.

I was wondering if there has been any progress? If I change the network definition as follows - network = BayesianNetwork([("X", "Z"), ("Z", "Y"), ("U", "X"), ("U", "Y")], latents=["U"]) Then the code does not crash but the results are still wrong and are exactly the same as if I defined the network by omitting "U" altogether.

I am about to publish an article about pgmpy vs. DoWhy and there are two areas were pgmpy currently lags behind -

  1. The unobserved confounders issue reported here.
  2. DoWhy will work with categorical and continuous variables natively. In pgmpy continuous variables are represented as a separate row in the CPD for every value and categorical variables need to be split up into "bins".

If these two issues could be addressed the pgmpy would be right up there with DoWhy and in fact it coulc be out in front as it is much easier to use.

Thanks and my very best regards

Graham

grahamharrison68 avatar Oct 31 '22 07:10 grahamharrison68

@grahamharrison68 Sorry, this took me a while but should work now. Here's the output from the example you mentioned:

In [2]: from pgmpy.models import BayesianNetwork

In [3]: from pgmpy.inference import CausalInference

In [4]: RANDOM_SEED = 1
   ...: sample_size = 1000
   ...: 
   ...: np.random.seed(RANDOM_SEED)
   ...: U = np.random.randn(sample_size) # Unobserved confounder
   ...: X = 0.3*U + np.random.randn(sample_size)
   ...: Z = 0.78*X + 0.3*np.random.randn(sample_size)
   ...: Y = 0.45*U + 0.87*Z + 0.4*np.random.randn(sample_size)
   ...: 
   ...: data = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})

In [5]: network = BayesianNetwork([("X", "Z"), ("Z", "Y"), ("U", "X"), ("U", "Y")], latents=['U'])

In [6]: infer = CausalInference(network)

In [7]: infer.estimate_ate('X', 'Y', data, estimator_type='linear')
Out[7]: 0.6599812388166989

Yeah, pgmpy doesn't have full support (some structure learning and causal estimation is supported) for continuous variables yet but I do plan to extend the SEM class (which is essentially a Gaussian BN), which would also allow us to also do parameter estimation and probabilistic inference for continuous variables. I also plan to extend some of the instrumental variable methods already implemented for SEMs to also work with DAG-based models.

ankurankan avatar Nov 01 '22 16:11 ankurankan

Hello @ankurankan. Thanks for looking at this. I have used pip to upgrade to the latest version of pgmpy which is now 0.1.20 and then run the code above. I still get 0.8047575187358673 and not 0.6599812388166989 which is produced in your example above so I am still getting the wrong answer. Please could I ask if you have released the codebase yet that has the fix in it that you have demonstrated above? Thanks and very best regards. Graham

grahamharrison68 avatar Nov 11 '22 08:11 grahamharrison68

@grahamharrison68 Sorry, I forgot to mention that this is still in the dev branch and should be released in version 0.1.21 (scheduled on 31st Dec). You can install the dev branch in the meanwhile using:

pip install git+https://github.com/pgmpy/pgmpy.git@dev

ankurankan avatar Nov 11 '22 12:11 ankurankan

Hello @ankurankan, your change has certainly worked for a front-door example but I still get the key error when I try for an instrumental variable.

Here is the code that reproduces the error ...

np.random.seed(RANDOM_SEED) U = np.random.randn(SAMPLE_SIZE) Z = np.random.randn(SAMPLE_SIZE) X = 0.6Z + 0.43U + np.random.randn(SAMPLE_SIZE) Y = 0.45U + 0.87X + 0.4*np.random.randn(SAMPLE_SIZE)

data_3 = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})

dag_3_pgmpy = BayesianNetwork([("Z", "X"), ("X", "Y"), ("U", "X"), ("U", "Y")])

inference = pgmpy.inference.CausalInference(model=dag_3_pgmpy) ate = inference.estimate_ate(X="X", Y="Y", data=data_3, estimator_type="linear") print(ate)

grahamharrison68 avatar Apr 11 '23 11:04 grahamharrison68

@grahamharrison68 Yeah, the IV support is a bit limited right now and only works for the SEM class: https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/estimators/SEMEstimator.py#L348. I will try to prioritize this to make it work on all models.

ankurankan avatar Apr 18 '23 04:04 ankurankan