KnowledgeGraphEmbedding
KnowledgeGraphEmbedding copied to clipboard
Memory consumption issue
I use the command:
bash run.sh train RotatE FB15k-237 0 0 1024 256 1000 9.0 1.0 0.00005 100000 16 -de
to train RotatE on a 11 GB GPU. I ensure it is completely free. I still get the following error:
2022-03-31 19:32:37,370 INFO negative_adversarial_sampling = False
2022-03-31 19:32:37,370 INFO learning_rate = 0
2022-03-31 19:32:39,079 INFO Training average positive_sample_loss at step 0: 5.635527
2022-03-31 19:32:39,079 INFO Training average negative_sample_loss at step 0: 0.003591
2022-03-31 19:32:39,079 INFO Training average loss at step 0: 2.819559
2022-03-31 19:32:39,079 INFO Evaluating on Valid Dataset...
2022-03-31 19:32:39,552 INFO Evaluating the model... (0/2192)
2022-03-31 19:33:38,650 INFO Evaluating the model... (1000/2192)
2022-03-31 19:34:38,503 INFO Evaluating the model... (2000/2192)
2022-03-31 19:34:49,981 INFO Valid MRR at step 0: 0.005509
2022-03-31 19:34:49,982 INFO Valid MR at step 0: 6894.798660
2022-03-31 19:34:49,982 INFO Valid HITS@1 at step 0: 0.004733
2022-03-31 19:34:49,982 INFO Valid HITS@3 at step 0: 0.005076
2022-03-31 19:34:49,982 INFO Valid HITS@10 at step 0: 0.005646
Traceback (most recent call last):
File "codes/run.py", line 371, in <module>
main(parse_args())
File "codes/run.py", line 315, in main
log = kge_model.train_step(kge_model, optimizer, train_iterator, args)
File "/home/prachi/related_work/KnowledgeGraphEmbedding/codes/model.py", line 315, in train_step
loss.backward()
File "/home/prachi/anaconda3/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/prachi/anaconda3/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.95 GiB (GPU 0; 10.92 GiB total capacity; 7.41 GiB already allocated; 1.51 GiB free; 1.52 GiB cached)
run.sh: line 79:
CUDA_VISIBLE_DEVICES=$GPU_DEVICE python -u $CODE_PATH/run.py --do_train \
--cuda \
--do_valid \
--do_test \
--data_path $FULL_DATA_PATH \
--model $MODEL \
-n $NEGATIVE_SAMPLE_SIZE -b $BATCH_SIZE -d $HIDDEN_DIM \
-g $GAMMA -a $ALPHA -adv \
-lr $LEARNING_RATE --max_steps $MAX_STEPS \
-save $SAVE --test_batch_size $TEST_BATCH_SIZE \
${14} ${15} ${16} ${17} ${18} ${19} ${20}
: No such file or directory
I get similar errors on trying to train FB15k using the command in best_config.sh file. I reduced the batchsize to 500 and it worked but the performance is much less than the numbers reported in the paper.
I am not sure what is the issue.
On running the code on a larger server I found out it is taking 13979MiB.
And reducing the advised batch size affects the model performance adversely.