causalnex icon indicating copy to clipboard operation
causalnex copied to clipboard

EMSingleLatentVariable is producing random error at random times

Open ianchlee opened this issue 2 years ago • 1 comments

Description

I was trying to determine a single latent variable in my model, and when I tried to run the EM algorithm using fit_latent_cpds, it sometimes throw random errors while some times it can product some result.

Steps to Reproduce

I have created the following test data to try the model:

data = pd.DataFrame({'node1': np.repeat(1, 50), 'node2': np.repeat(1,50)})
for i in [0, 3, 5, 13, 17, 29, 30, 31, 32]:
    data['node1'][i] = 0

for i in [4,5,11,15,17,25,27,34,41,47]:
    data['node2'][i] = 0

The data structure is very simple, a latent variable latent1 that affects node1 and node2.

sm = StructureModel()
sm.add_edges_from([('latent1', 'node1'), ('latent1', 'node2')])
bn = BayesianNetwork(sm)
bn.node_states = {'latent1':{0,1}, 'node1': {0,1}, 'node2': {0,1}}
bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)

Some times I received good result as following:

{'latent1':                  
latent1          
0        0.283705
1        0.716295,

'node1': latent1        0         1
node1                     
0        0.21017  0.168051
1        0.78983  0.831949,

'node2': latent1         0         1
node2                      
0        0.253754  0.178709
1        0.746246  0.821291}

However, some times I receive different error messages:

Traceback (most recent call last):
  File "test_2.py", line 28, in <module>
    bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/network/network.py", line 553, in fit_latent_cpds
    estimator = EMSingleLatentVariable(
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 144, in __init__
    self._mb_data, self._mb_partitions = self._get_markov_blanket_data(data)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 585, in _get_markov_blanket_data
    mb_product = cpd_multiplication([self.cpds[node] for node in self.valid_nodes])
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/utils/pgmpy_utils.py", line 122, in cpd_multiplication
    product_pgmpy = factor_product(*cpds_pgmpy)  # type: TabularCPD
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/base.py", line 76, in factor_product
    return reduce(lambda phi1, phi2: phi1 * phi2, args)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/base.py", line 76, in <lambda>
    return reduce(lambda phi1, phi2: phi1 * phi2, args)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 930, in __mul__
    return self.product(other, inplace=False)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 697, in product
    phi = self if inplace else self.copy()
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/CPD.py", line 299, in copy
    return TabularCPD(
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/CPD.py", line 142, in __init__
    super(TabularCPD, self).__init__(
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 99, in __init__
    raise ValueError("Variable names cannot be same")
ValueError: Variable names cannot be same

And sometimes I receive this error:

Traceback (most recent call last):
  File "test_2.py", line 28, in <module>
    bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/network/network.py", line 563, in fit_latent_cpds
    estimator.run(n_runs=n_runs, stopping_delta=stopping_delta)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 181, in run
    self.e_step()  # Expectation step
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 233, in e_step
    results = self._update_sufficient_stats(node_mb_data["_lookup_"])
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 448, in _update_sufficient_stats
    prob_lv_given_mb = self._mb_product[mb_cols]
KeyError: (nan, 0.0)

My code originally also includes the boundaries and priors, however I realise these two errors just randomly pop up at different times.

Please let me know if I have done something wrong in setting up the network.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • CausalNex version used (pip show causalnex): 0.11.0
  • Python version used (python -V): 3.8.15 (via conda)
  • Operating system and version: Mac OS M1

ianchlee avatar Dec 06 '22 08:12 ianchlee

For reference: https://github.com/pgmpy/pgmpy/issues/1582

In line 702 of DiscreteFactor.py from pgmpy library

Change from new_variables = list(set(phi.variables).union(phi1.variables)) to new_variables = phi.variables + [var for var in phi1.variables if var not in phi.variables]

ngkaching avatar Aug 08 '23 08:08 ngkaching