dgl-ke
dgl-ke copied to clipboard
Error during Model Prediction
I trained a TransE model and ran the following code snippet for the model prediction -
DGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_JapEnc_2/
--format '*_r_t' --data_files data/rel.list data/tail.list
--score_func logsigmoid --exec_mode 'batch_head'
--raw_data --entity_mfile data/entities.tsv --rel_mfile data/relations.tsv
On running this code, I encountered the following error -
ckpts/TransE_l1_JapEnc_2/config.json
{'dataset': 'JapEnc', 'model': 'TransE_l1', 'emb_size': 400, 'max_train_step': 500, 'batch_size': 1000, 'neg_sample_size': 200, 'lr': 0.01, 'gamma': 19.9, 'double_ent': False, 'double_rel': False, 'neg_adversarial_sampling': True, 'adversarial_temperature': 1.0, 'regularization_coef': 2e-08, 'regularization_norm': 3, 'emap_file': 'entities.tsv', 'rmap_file': 'relations.tsv'}
Traceback (most recent call last):
File "/usr/local/bin/dglke_predict", line 8, in
Please help me with how to resolve this issue. Thanks in advance.
it's most likely caused by pytorch. could you tell us what pytorch version you use.
sourav1312 [email protected] 于 2021年3月6日周六 上午6:22写道:
I trained a TransE model and ran the following code snippet for the model prediction -
DGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_JapEnc_2/ --format '_r_t' --data_files data/rel.list data/tail.list --score_func logsigmoid --exec_mode 'batch_head' --raw_data --entity_mfile data/entities.tsv --rel_mfile data/relations.tsv*
On running this code, I encountered the following error -
ckpts/TransE_l1_JapEnc_2/config.json {'dataset': 'JapEnc', 'model': 'TransE_l1', 'emb_size': 400, 'max_train_step': 500, 'batch_size': 1000, 'neg_sample_size': 200, 'lr': 0.01, 'gamma': 19.9, 'double_ent': False, 'double_rel': False, 'neg_adversarial_sampling': True, 'adversarial_temperature': 1.0, 'regularization_coef': 2e-08, 'regularization_norm': 3, 'emap_file': 'entities.tsv', 'rmap_file': 'relations.tsv'} Traceback (most recent call last): File "/usr/local/bin/dglke_predict", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/dglke/infer_score.py", line 216, in main result = model.topK(head, rel, tail, args.exec_mode, args.topK) File "/usr/local/lib/python3.7/dist-packages/dglke/models/infer.py", line 173, in topK F.asnumpy(rel[rel_idx]), IndexError: tensors used as indices must be long, byte or bool tensors
Please help me with how to resolve this issue. Thanks in advance.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/awslabs/dgl-ke/issues/189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARGUNE6HMBUNVZR5RPLUTTCI3D3ANCNFSM4YWYYZHQ .
PYTORCH VERSION: 1.7.1+cu101 DGL VERSION: 0.4.3 DGL-KE VERSION: 0.1.2 Code executed in GOOGLE COLAB
can you try pytorch 1.6? this is the pytorch version we are using.
Hi! Thanks for the reply.
I have re-run the code on WN18 dataset and using pytorch version==1.6.0 on Google Colab. During dglke-predict mode, I executed the following code as per the examples folder. On execution, it shows the following output and raises an error-
**ckpts/TransE_l1_wn18_custom_0/config.json
{'dataset': 'wn18_custom', 'model': 'TransE_l1', 'emb_size': 512, 'max_train_step': 2000, 'batch_size': 2048, 'neg_sample_size': 128, 'lr': 0.007, 'gamma': 12.0, 'double_ent': False, 'double_rel': False, 'neg_adversarial_sampling': True, 'adversarial_temperature': 1.0, 'regularization_coef': 2e-07, 'regularization_norm': 3, 'emap_file': 'entities.tsv', 'rmap_file': 'relations.tsv'}
tcmalloc: large alloc 8589934592 bytes == 0x5600e4cfc000 @ 0x7fd3fee0bb6b 0x7fd3fee2b379 0x7fd39fe2192e 0x7fd39fe23946 0x7fd3dbd289e5 0x7fd3dbfadaf3 0x7fd3dbf9ef97 0x7fd3dbf9ec7d 0x7fd3dbf9ef97 0x7fd3dc0a9a1a 0x7fd3dbd394f8 0x7fd3dbd3b166 0x7fd3dbd3b65d 0x7fd3dbd3b80a 0x7fd3dba79fb8 0x7fd3dbfae3f9 0x7fd3db950254 0x7fd3dc080823 0x7fd3ddcbe071 0x7fd3db950254 0x7fd3dc1f4213 0x7fd3eb629282 0x7fd3eb629c06 0x5600d1bab874 0x5600d1a96292 0x5600d1adb410 0x5600d1adce20 0x5600d1b7fbc2 0x5600d1b06760 0x5600d1a9769a 0x5600d1b09e50
CalledProcessError Traceback (most recent call last)
2 frames /usr/local/lib/python3.7/dist-packages/google/colab/_system_commands.py in check_returncode(self) 137 if self.returncode: 138 raise subprocess.CalledProcessError( --> 139 returncode=self.returncode, cmd=self.args, output=self.output) 140 141 def repr_pretty(self, p, cycle): # pylint:disable=unused-argument CalledProcessError: Command ' cd my_task
DGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_wn18_custom_0/
--format r --data_files rel.list --topK 5' died with <Signals.SIGKILL: 9>.**
It seems like OOM. How large is your instance?
wn18_custom_TransE_l1_relation.npy = 36KB wn18_custom_TransE_l1_entity.npy = 80MB config.json = 450Bytes
I don't understand the reason for out-of-memory as Google Colab offers around 12-13GB of RAM. Moreover, I used the benchmark database "WN18" to train my model and to understand the working of dglke-predict in a practical manner. I also changed the PyTorch version to 1.6.0 as suggested.
How many lines in your rel.list? For each relation, It will try to calculate (number of nodes * number of nodes) possible combinations, which is time consuming.
I used the sample rel.list file given in the examples subfolder. It contains just one relationship with an index of 0. Here is the link: https://github.com/awslabs/dgl-ke/blob/master/examples/wn18/rel.list
From you log, it seems the tcmalloc is trying to alloc 8GB memory (tcmalloc: large alloc 8589934592 bytes). What kinds of command you are running? Can you check how many memory is free in your Colab instance?
can you try pytorch 1.6? this is the pytorch version we are using.
my pytorch is 1.6, but has another error:
Traceback (most recent call last):
File "/data_local/venv/dgl_env/bin/dglke_predict", line 8, in
pytorch 1.6 no longer support the div or / ,
this is my packages version
certifi 2020.12.5
chardet 4.0.0
decorator 4.4.2
dgl 0.6.1
dglke 0.1.2
future 0.18.2
idna 2.10
networkx 2.5.1
numpy 1.19.5
Pillow 8.2.0
pip 21.0.1
requests 2.25.1
scipy 1.5.4
setuptools 52.0.0
torch 1.6.0
torchvision 0.7.0
urllib3 1.26.4
wheel 0.36.2
it's most likely caused by pytorch. could you tell us what pytorch version you use. sourav1312 [email protected] 于 2021年3月6日周六 上午6:22写道: … I trained a TransE model and ran the following code snippet for the model prediction - DGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_JapEnc_2/ --format '_r_t' --data_files data/rel.list data/tail.list --score_func logsigmoid --exec_mode 'batch_head' --raw_data --entity_mfile data/entities.tsv --rel_mfile data/relations.tsv* On running this code, I encountered the following error - ckpts/TransE_l1_JapEnc_2/config.json {'dataset': 'JapEnc', 'model': 'TransE_l1', 'emb_size': 400, 'max_train_step': 500, 'batch_size': 1000, 'neg_sample_size': 200, 'lr': 0.01, 'gamma': 19.9, 'double_ent': False, 'double_rel': False, 'neg_adversarial_sampling': True, 'adversarial_temperature': 1.0, 'regularization_coef': 2e-08, 'regularization_norm': 3, 'emap_file': 'entities.tsv', 'rmap_file': 'relations.tsv'} Traceback (most recent call last): File "/usr/local/bin/dglke_predict", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/dglke/infer_score.py", line 216, in main result = model.topK(head, rel, tail, args.exec_mode, args.topK) File "/usr/local/lib/python3.7/dist-packages/dglke/models/infer.py", line 173, in topK F.asnumpy(rel[rel_idx]), IndexError: tensors used as indices must be long, byte or bool tensors Please help me with how to resolve this issue. Thanks in advance. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#189>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARGUNE6HMBUNVZR5RPLUTTCI3D3ANCNFSM4YWYYZHQ .
This issue can easily be resolved by casting the tensor indexers as long, e.g
tail_idx = (idx % num_tail).long()
Tried with PyTorch 1.10.0+cu111 (default version on Colab) with no issues. I think it might be worth to change it, since that as far as I've seen is the only thing keeping the package from running properly with current PyTorch versions.