DacKGR icon indicating copy to clipboard operation
DacKGR copied to clipboard

Runtime error in graph search policy network during training

Open nitishajain opened this issue 4 years ago • 7 comments

Hello, I am trying to replicate the steps to train and test the model. After performing the data processing and pretraining of embeddings, I keep encountering the following runtime error when training the model for any dataset -

Epoch 0
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 822, in <module>
    run_experiment(args)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 803, in run_experiment
    train(lf)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 267, in train
    lf.run_train(train_data, dev_data)
  File "/home/user/DacKGR/DacKGR-master/src/learn_framework.py", line 96, in run_train
    loss = self.loss(mini_batch)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/rs_pg.py", line 115, in loss
    output = self.rollout(e1, r, e2, num_steps=self.num_rollout_steps, kg_pred=kg_pred)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/rs_pg.py", line 282, in rollout
    e, obs, kg, kg_pred=kg_pred, fn_kg=self.fn_kg, use_action_space_bucketing=self.use_action_space_bucketing, use_kg_pred=self.use_state_prediction)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 138, in transit
    db_action_spaces, db_references = self.get_action_space_in_buckets(e, obs, kg, relation_att=relation_att, inference=inference)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 289, in get_action_space_in_buckets
    e_space_b, r_space_b, action_mask_b = self.get_dynamic_action_space(e_space_b, r_space_b, action_mask_b, e_b, relation_att[l_batch_refs])
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 208, in get_dynamic_action_space
    relation_idx = torch.multinomial(relation_att, additional_relation_size)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
free(): invalid pointer
./experiment-rs.sh: line 87: 560302 Aborted    

Any pointers to solve this issue would be most helpful..

nitishajain avatar May 26 '21 20:05 nitishajain

To better identify the problem, could you tell me what dataset you are running on?

davidlvxin avatar May 28 '21 06:05 davidlvxin

I have encountered the exact same issue both with WD-singer as well as FB-15k-237 subsets, makes me think its not quite a dataset specific issue..

nitishajain avatar May 28 '21 12:05 nitishajain

Could you give your PyTorch version? I redownload and run the code without encountering any errors. Using FB15K-237-20% as an example, make sure you run the following commands in order:

unzip data.zip
./experiment.sh configs/fb15k-237-20.sh --process_data <gpu-id>
./experiment-emb.sh configs/fb15k-237-20-conve.sh --train <gpu-id>
./experiment-rs.sh configs/fb15k-237-20-rs.sh --train <gpu-id>

davidlvxin avatar May 31 '21 04:05 davidlvxin

The Pytorch version is 1.7.0 I have tried creating a new environment and running the commands again in the correct order, but I am still getting the same error after training for 3 epochs.

nitishajain avatar May 31 '21 21:05 nitishajain

I am sorry that I have run this code many times, but this error cannot be reproduced. What is your CUDA version?

davidlvxin avatar Jun 02 '21 05:06 davidlvxin

The CUDA version is 11.0 thank you for your efforts, could you inform your version as well? I can try to reproduce in same environment.

nitishajain avatar Jun 02 '21 17:06 nitishajain

Pytorch: 1.8.1 CUDA: 11.1

It seems that our environments are very similar.

davidlvxin avatar Jun 04 '21 00:06 davidlvxin