Runtime error in graph search policy network during training
Hello, I am trying to replicate the steps to train and test the model. After performing the data processing and pretraining of embeddings, I keep encountering the following runtime error when training the model for any dataset -
Epoch 0
Traceback (most recent call last):
File "/home/user/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/user/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 822, in <module>
run_experiment(args)
File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 803, in run_experiment
train(lf)
File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 267, in train
lf.run_train(train_data, dev_data)
File "/home/user/DacKGR/DacKGR-master/src/learn_framework.py", line 96, in run_train
loss = self.loss(mini_batch)
File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/rs_pg.py", line 115, in loss
output = self.rollout(e1, r, e2, num_steps=self.num_rollout_steps, kg_pred=kg_pred)
File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/rs_pg.py", line 282, in rollout
e, obs, kg, kg_pred=kg_pred, fn_kg=self.fn_kg, use_action_space_bucketing=self.use_action_space_bucketing, use_kg_pred=self.use_state_prediction)
File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 138, in transit
db_action_spaces, db_references = self.get_action_space_in_buckets(e, obs, kg, relation_att=relation_att, inference=inference)
File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 289, in get_action_space_in_buckets
e_space_b, r_space_b, action_mask_b = self.get_dynamic_action_space(e_space_b, r_space_b, action_mask_b, e_b, relation_att[l_batch_refs])
File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 208, in get_dynamic_action_space
relation_idx = torch.multinomial(relation_att, additional_relation_size)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
free(): invalid pointer
./experiment-rs.sh: line 87: 560302 Aborted
Any pointers to solve this issue would be most helpful..
To better identify the problem, could you tell me what dataset you are running on?
I have encountered the exact same issue both with WD-singer as well as FB-15k-237 subsets, makes me think its not quite a dataset specific issue..
Could you give your PyTorch version? I redownload and run the code without encountering any errors. Using FB15K-237-20% as an example, make sure you run the following commands in order:
unzip data.zip
./experiment.sh configs/fb15k-237-20.sh --process_data <gpu-id>
./experiment-emb.sh configs/fb15k-237-20-conve.sh --train <gpu-id>
./experiment-rs.sh configs/fb15k-237-20-rs.sh --train <gpu-id>
The Pytorch version is 1.7.0 I have tried creating a new environment and running the commands again in the correct order, but I am still getting the same error after training for 3 epochs.
I am sorry that I have run this code many times, but this error cannot be reproduced. What is your CUDA version?
The CUDA version is 11.0 thank you for your efforts, could you inform your version as well? I can try to reproduce in same environment.
Pytorch: 1.8.1 CUDA: 11.1
It seems that our environments are very similar.