causalnex
causalnex copied to clipboard
EMSingleLatentVariable is producing random error at random times
Description
I was trying to determine a single latent variable in my model, and when I tried to run the EM algorithm using fit_latent_cpds, it sometimes throw random errors while some times it can product some result.
Steps to Reproduce
I have created the following test data to try the model:
data = pd.DataFrame({'node1': np.repeat(1, 50), 'node2': np.repeat(1,50)})
for i in [0, 3, 5, 13, 17, 29, 30, 31, 32]:
data['node1'][i] = 0
for i in [4,5,11,15,17,25,27,34,41,47]:
data['node2'][i] = 0
The data structure is very simple, a latent variable latent1
that affects node1
and node2
.
sm = StructureModel()
sm.add_edges_from([('latent1', 'node1'), ('latent1', 'node2')])
bn = BayesianNetwork(sm)
bn.node_states = {'latent1':{0,1}, 'node1': {0,1}, 'node2': {0,1}}
bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)
Some times I received good result as following:
{'latent1':
latent1
0 0.283705
1 0.716295,
'node1': latent1 0 1
node1
0 0.21017 0.168051
1 0.78983 0.831949,
'node2': latent1 0 1
node2
0 0.253754 0.178709
1 0.746246 0.821291}
However, some times I receive different error messages:
Traceback (most recent call last):
File "test_2.py", line 28, in <module>
bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/network/network.py", line 553, in fit_latent_cpds
estimator = EMSingleLatentVariable(
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 144, in __init__
self._mb_data, self._mb_partitions = self._get_markov_blanket_data(data)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 585, in _get_markov_blanket_data
mb_product = cpd_multiplication([self.cpds[node] for node in self.valid_nodes])
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/utils/pgmpy_utils.py", line 122, in cpd_multiplication
product_pgmpy = factor_product(*cpds_pgmpy) # type: TabularCPD
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/base.py", line 76, in factor_product
return reduce(lambda phi1, phi2: phi1 * phi2, args)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/base.py", line 76, in <lambda>
return reduce(lambda phi1, phi2: phi1 * phi2, args)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 930, in __mul__
return self.product(other, inplace=False)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 697, in product
phi = self if inplace else self.copy()
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/CPD.py", line 299, in copy
return TabularCPD(
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/CPD.py", line 142, in __init__
super(TabularCPD, self).__init__(
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 99, in __init__
raise ValueError("Variable names cannot be same")
ValueError: Variable names cannot be same
And sometimes I receive this error:
Traceback (most recent call last):
File "test_2.py", line 28, in <module>
bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/network/network.py", line 563, in fit_latent_cpds
estimator.run(n_runs=n_runs, stopping_delta=stopping_delta)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 181, in run
self.e_step() # Expectation step
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 233, in e_step
results = self._update_sufficient_stats(node_mb_data["_lookup_"])
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 448, in _update_sufficient_stats
prob_lv_given_mb = self._mb_product[mb_cols]
KeyError: (nan, 0.0)
My code originally also includes the boundaries and priors, however I realise these two errors just randomly pop up at different times.
Please let me know if I have done something wrong in setting up the network.
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- CausalNex version used (
pip show causalnex
): 0.11.0 - Python version used (
python -V
): 3.8.15 (via conda) - Operating system and version: Mac OS M1
For reference: https://github.com/pgmpy/pgmpy/issues/1582
In line 702 of DiscreteFactor.py from pgmpy library
Change from
new_variables = list(set(phi.variables).union(phi1.variables))
to
new_variables = phi.variables + [var for var in phi1.variables if var not in phi.variables]