pgmpy
pgmpy copied to clipboard
Variable elimination failed when working with large number of nodes
Subject of the issue
Carry on from issue https://github.com/pgmpy/pgmpy/issues/1385, when I shrink everything by:
- Reducing the number of states (from 1000) to 2 for each node
- Reduce the number of variables (from 360) to ~60 I can then perform variable elimination. However, when performing print(model_infer.query([Some Variable], elimination_order = elimination_list, evidence = evidence_dict))
The function failed and received the following error when there is no evidence assigned or number of evidence given is less than a constant(in my case less than 3 evidence given). When I assign more than 3 evidences to the function the code works perfectly.
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\inference\ExactInference.py in query(self, variables, evidence, elimination_order, joint, show_progress) 261 elimination_order=elimination_order, 262 joint=joint, --> 263 show_progress=show_progress, 264 ) 265
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\inference\ExactInference.py in _variable_elimination(self, variables, operation, evidence, elimination_order, joint, show_progress) 179 if not set(factor.variables).intersection(eliminated_variables) 180 ] --> 181 phi = factor_product(*factors) 182 phi = getattr(phi, operation)([var], inplace=False) 183 del working_factors[var]
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\base.py in factor_product(*args) 68 ) 69 ---> 70 return reduce(lambda phi1, phi2: phi1 * phi2, args) 71 72
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\base.py in
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py in mul(self, other) 890 891 def mul(self, other): --> 892 return self.product(other, inplace=False) 893 894 def rmul(self, other):
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py in product(self, phi1, inplace) 680 [55, 77]]]] 681 """ --> 682 phi = self if inplace else self.copy() 683 if isinstance(phi1, (int, float)): 684 phi.values *= phi1
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py in copy(self) 828 self.cardinality, 829 self.values, --> 830 state_names=self.state_names.copy(), 831 ) 832
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py in init(self, variables, cardinality, values, state_names) 90 91 if values.size != np.product(cardinality): ---> 92 raise ValueError(f"Values array must be of size: {np.product(cardinality)}") 93 94 if len(set(variables)) != len(variables):
ValueError: Values array must be of size: 663519232
The same issue was spotted in https://github.com/pgmpy/pgmpy/issues/869 and https://github.com/pgmpy/pgmpy/issues/573 a few years ago but I still have this problem with my dataset. I wonder if it is because of the large number of nodes that I have. Please kindly let me know what you think. Thank you for your attention!
@Cby19961020 Could you also tell me what elimination order did you use? Or if possible, could you post a minimal reproducible code, so that I can check in detail what's happening?
Hi @ankurankan Thank you for your attention, please take a look at the elimination order selection process below for more information:
#List of all the nodes master_list = list_1.copy()
#Remove the node that we want to predict master_list.remove('ID60_t+31_Good')
#Remove all the nodes that is given as evidence for item in evidence_dict.keys(): master_list.remove(item)
#Calculate the most optimal elimination order using the built in function elimination_list = WeightedMinFill(model).get_elimination_order(master_list)
#Calculate the conditional probability using variable elimination print(model_infer.query(['ID60_t+31_Good'], elimination_order = elimination_list,evidence = evidence_dict))
Like I said the code does work but I have to include many nodes as evidence. If I have little to no evidence then the code will fail and spit out the error I pasted above.
I do not own the data (do not have the right to publish the data) hence can't not provide you with the data here. If the information above is still not enough for you to debug please kindly leave me your email address here and I will send you part of the data that I am working on. Thank you.
Bo
@Cby19961020 I pushed some optimizations for Variable Elimination (https://github.com/pgmpy/pgmpy/pull/1398) today which should automatically reduce the size of your network based on the query and evidence variables. Maybe you can install the dev branch and give it another try?
Hi @ankurankan,
Thank you for your attention. I installed the dev version and tried one more time and received the following error:
KeyError Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\inference\ExactInference.py in query(self, variables, evidence, elimination_order, joint, show_progress) 267 elimination_order=elimination_order, 268 joint=joint, --> 269 show_progress=show_progress, 270 ) 271 self.model = orig_model
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\inference\ExactInference.py in _variable_elimination(self, variables, operation, evidence, elimination_order, joint, show_progress) 176 factors = [ 177 factor --> 178 for factor, _ in working_factors[var] 179 if not set(factor.variables).intersection(eliminated_variables) 180 ]
KeyError: 'ID60_t+28_GoodParts'
I think the variable that needed to be eliminated was deleted to reduce the size of the model. Please kindly take a look and let me know if I can provide you with any additional information. Thank you again for your great work!
Regards, Bo
@Cby19961020 Thanks for checking, and I missed the use case of manually specifying the elimination order. I will have to fix that. But since you are using one of the implemented heuristics to compute the elimination order, you can just do something like:
model_infer.query(['ID60_t+31_GoodParts'], elimination_order = "WeightedMinFill")
Hi @ankurankan, I've attempted the method mentioned above and the error still exists:
ValueError Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\inference\ExactInference.py in query(self, variables, evidence, elimination_order, joint, show_progress) 267 elimination_order=elimination_order, 268 joint=joint, --> 269 show_progress=show_progress, 270 ) 271 self.model = orig_model
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\inference\ExactInference.py in _variable_elimination(self, variables, operation, evidence, elimination_order, joint, show_progress) 180 ] 181 phi = factor_product(*factors) --> 182 phi = getattr(phi, operation)([var], inplace=False) 183 del working_factors[var] 184 for variable in phi.variables:
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py in marginalize(self, variables, inplace) 345 raise TypeError("variables: Expected type list or array-like, got type str") 346 --> 347 phi = self if inplace else self.copy() 348 349 for var in variables:
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py in copy(self) 856 self.cardinality, 857 self.values, --> 858 state_names=self.state_names.copy(), 859 ) 860
C:\ProgramData\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py in init(self, variables, cardinality, values, state_names) 93 94 if values.size != np.product(cardinality): ---> 95 raise ValueError(f"Values array must be of size: {np.product(cardinality)}") 96 97 if len(set(variables)) != len(variables):
ValueError: Values array must be of size: -48234496
@Cby19961020 The only way I can think of making this work would be to reduce the number of states of the variables. With 1000 states for each variable, a multiplication of just 4 factors would need around 7 TBs of memory.
Hi @ankurankan , thank you for your kiind clarification. I did attempt using all 1000 states to begin with but obviously failed. I then discretized the states from 1000 to 2 states only(i.e. bigger than 2, and smaller or equal to 2) and still failed. All the error shown above are done using 2 states only for each variable for a total number of variables of ~60.
+1 on this. @ankurankan This seems to fail after certain nodes are added. It sometimes does pass. I think its beyond memory error. Steps to reproduce: Take the insurance example model and run the query of p(Accident| evidence) where evidence is
evidence = {'Age': 'Adult', 'SocioEcon': 'Prole', 'RiskAversion': 'Psychopath', 'VehicleYear': 'Older', 'ThisCarDam': 'Moderate', 'RuggedAuto': 'EggShell', 'MakeModel': 'Economy', 'DrivQuality': 'Excellent', 'Mileage': 'TwentyThou', 'Antilock': True, 'DrivingSkill': 'SubStandard', 'SeniorTrain': True, 'ThisCarCost': 'TenThou', 'Theft': True, 'CarValue': 'FiftyThou', 'HomeBase': 'City', 'AntiTheft': True, 'PropCost': 'Thousand', 'OtherCarCost': 'Thousand', 'OtherCar': True, 'MedCost': 'Thousand', 'Cushioning': 'Poor', 'Airbag': True, 'ILiCost': 'Thousand', 'DrivHist': 'Many'}
Also, this works when you remove all Boolean variables. I think the way the cardinality is treated for boolean could be the underlying issue.
@harishkashyap The problem, in this case, is that the state names for the binary variables aren't boolean as specified in the evidence dict above. For example if you look at the state names for the Antilock variable:
{'MakeModel': ['SportsCar', 'Economy', 'FamilySedan', 'Luxury', 'SuperLuxury'],
'VehicleYear': ['Current', 'Older'],
'Antilock': ['True', 'False']}
it's strings True and False, and hence query throws an error when the evidence is specified as a boolean True.
Oh! I see. Got it. Perhaps I can put up a PR with a message to change variable type? That could help resolve few other issues as well.
@harishkashyap Yeah, I think it would be helpful to show a better error message in these cases.