dgl-ke icon indicating copy to clipboard operation
dgl-ke copied to clipboard

Error during Model Prediction

Open sourav1312 opened this issue 3 years ago • 11 comments

I trained a TransE model and ran the following code snippet for the model prediction -

DGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_JapEnc_2/
--format '*_r_t' --data_files data/rel.list data/tail.list
--score_func logsigmoid --exec_mode 'batch_head'
--raw_data --entity_mfile data/entities.tsv --rel_mfile data/relations.tsv

On running this code, I encountered the following error -

ckpts/TransE_l1_JapEnc_2/config.json {'dataset': 'JapEnc', 'model': 'TransE_l1', 'emb_size': 400, 'max_train_step': 500, 'batch_size': 1000, 'neg_sample_size': 200, 'lr': 0.01, 'gamma': 19.9, 'double_ent': False, 'double_rel': False, 'neg_adversarial_sampling': True, 'adversarial_temperature': 1.0, 'regularization_coef': 2e-08, 'regularization_norm': 3, 'emap_file': 'entities.tsv', 'rmap_file': 'relations.tsv'} Traceback (most recent call last): File "/usr/local/bin/dglke_predict", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/dglke/infer_score.py", line 216, in main result = model.topK(head, rel, tail, args.exec_mode, args.topK) File "/usr/local/lib/python3.7/dist-packages/dglke/models/infer.py", line 173, in topK F.asnumpy(rel[rel_idx]), IndexError: tensors used as indices must be long, byte or bool tensors

Please help me with how to resolve this issue. Thanks in advance.

sourav1312 avatar Mar 06 '21 14:03 sourav1312

it's most likely caused by pytorch. could you tell us what pytorch version you use.

sourav1312 [email protected] 于 2021年3月6日周六 上午6:22写道:

I trained a TransE model and ran the following code snippet for the model prediction -

DGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_JapEnc_2/ --format '_r_t' --data_files data/rel.list data/tail.list --score_func logsigmoid --exec_mode 'batch_head' --raw_data --entity_mfile data/entities.tsv --rel_mfile data/relations.tsv*

On running this code, I encountered the following error -

ckpts/TransE_l1_JapEnc_2/config.json {'dataset': 'JapEnc', 'model': 'TransE_l1', 'emb_size': 400, 'max_train_step': 500, 'batch_size': 1000, 'neg_sample_size': 200, 'lr': 0.01, 'gamma': 19.9, 'double_ent': False, 'double_rel': False, 'neg_adversarial_sampling': True, 'adversarial_temperature': 1.0, 'regularization_coef': 2e-08, 'regularization_norm': 3, 'emap_file': 'entities.tsv', 'rmap_file': 'relations.tsv'} Traceback (most recent call last): File "/usr/local/bin/dglke_predict", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/dglke/infer_score.py", line 216, in main result = model.topK(head, rel, tail, args.exec_mode, args.topK) File "/usr/local/lib/python3.7/dist-packages/dglke/models/infer.py", line 173, in topK F.asnumpy(rel[rel_idx]), IndexError: tensors used as indices must be long, byte or bool tensors

Please help me with how to resolve this issue. Thanks in advance.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/awslabs/dgl-ke/issues/189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARGUNE6HMBUNVZR5RPLUTTCI3D3ANCNFSM4YWYYZHQ .

zheng-da avatar Mar 06 '21 19:03 zheng-da

PYTORCH VERSION: 1.7.1+cu101 DGL VERSION: 0.4.3 DGL-KE VERSION: 0.1.2 Code executed in GOOGLE COLAB

sourav1312 avatar Mar 07 '21 05:03 sourav1312

can you try pytorch 1.6? this is the pytorch version we are using.

zheng-da avatar Mar 13 '21 06:03 zheng-da

Hi! Thanks for the reply.

I have re-run the code on WN18 dataset and using pytorch version==1.6.0 on Google Colab. During dglke-predict mode, I executed the following code as per the examples folder. On execution, it shows the following output and raises an error-

**ckpts/TransE_l1_wn18_custom_0/config.json {'dataset': 'wn18_custom', 'model': 'TransE_l1', 'emb_size': 512, 'max_train_step': 2000, 'batch_size': 2048, 'neg_sample_size': 128, 'lr': 0.007, 'gamma': 12.0, 'double_ent': False, 'double_rel': False, 'neg_adversarial_sampling': True, 'adversarial_temperature': 1.0, 'regularization_coef': 2e-07, 'regularization_norm': 3, 'emap_file': 'entities.tsv', 'rmap_file': 'relations.tsv'} tcmalloc: large alloc 8589934592 bytes == 0x5600e4cfc000 @ 0x7fd3fee0bb6b 0x7fd3fee2b379 0x7fd39fe2192e 0x7fd39fe23946 0x7fd3dbd289e5 0x7fd3dbfadaf3 0x7fd3dbf9ef97 0x7fd3dbf9ec7d 0x7fd3dbf9ef97 0x7fd3dc0a9a1a 0x7fd3dbd394f8 0x7fd3dbd3b166 0x7fd3dbd3b65d 0x7fd3dbd3b80a 0x7fd3dba79fb8 0x7fd3dbfae3f9 0x7fd3db950254 0x7fd3dc080823 0x7fd3ddcbe071 0x7fd3db950254 0x7fd3dc1f4213 0x7fd3eb629282 0x7fd3eb629c06 0x5600d1bab874 0x5600d1a96292 0x5600d1adb410 0x5600d1adce20 0x5600d1b7fbc2 0x5600d1b06760 0x5600d1a9769a 0x5600d1b09e50 CalledProcessError Traceback (most recent call last) in () ----> 1 get_ipython().run_cell_magic('shell', '', '\ncd my_task\n\nDGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_wn18_custom_0/ \\n--format r --data_files rel.list --topK 5')

2 frames /usr/local/lib/python3.7/dist-packages/google/colab/_system_commands.py in check_returncode(self) 137 if self.returncode: 138 raise subprocess.CalledProcessError( --> 139 returncode=self.returncode, cmd=self.args, output=self.output) 140 141 def repr_pretty(self, p, cycle): # pylint:disable=unused-argument CalledProcessError: Command ' cd my_task

DGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_wn18_custom_0/
--format r --data_files rel.list --topK 5' died with <Signals.SIGKILL: 9>.**

sourav1312 avatar Mar 14 '21 06:03 sourav1312

It seems like OOM. How large is your instance?

classicsong avatar Mar 14 '21 15:03 classicsong

wn18_custom_TransE_l1_relation.npy = 36KB wn18_custom_TransE_l1_entity.npy = 80MB config.json = 450Bytes

I don't understand the reason for out-of-memory as Google Colab offers around 12-13GB of RAM. Moreover, I used the benchmark database "WN18" to train my model and to understand the working of dglke-predict in a practical manner. I also changed the PyTorch version to 1.6.0 as suggested.

sourav1312 avatar Mar 14 '21 15:03 sourav1312

How many lines in your rel.list? For each relation, It will try to calculate (number of nodes * number of nodes) possible combinations, which is time consuming.

classicsong avatar Mar 15 '21 07:03 classicsong

I used the sample rel.list file given in the examples subfolder. It contains just one relationship with an index of 0. Here is the link: https://github.com/awslabs/dgl-ke/blob/master/examples/wn18/rel.list

sourav1312 avatar Mar 15 '21 15:03 sourav1312

From you log, it seems the tcmalloc is trying to alloc 8GB memory (tcmalloc: large alloc 8589934592 bytes). What kinds of command you are running? Can you check how many memory is free in your Colab instance?

classicsong avatar Mar 16 '21 00:03 classicsong

can you try pytorch 1.6? this is the pytorch version we are using.

my pytorch is 1.6, but has another error:

Traceback (most recent call last): File "/data_local/venv/dgl_env/bin/dglke_predict", line 8, in sys.exit(main()) File "/data_local/venv/dgl_env/lib/python3.6/site-packages/dglke/infer_score.py", line 216, in main result = model.topK(head, rel, tail, args.exec_mode, args.topK) File "/data_local/venv/dgl_env/lib/python3.6/site-packages/dglke/models/infer.py", line 148, in topK idx = idx / num_tail RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

pytorch 1.6 no longer support the div or / ,

this is my packages version

certifi     2020.12.5
chardet     4.0.0
decorator   4.4.2
dgl         0.6.1
dglke       0.1.2
future      0.18.2
idna        2.10
networkx    2.5.1
numpy       1.19.5
Pillow      8.2.0
pip         21.0.1
requests    2.25.1
scipy       1.5.4
setuptools  52.0.0
torch       1.6.0
torchvision 0.7.0
urllib3     1.26.4
wheel       0.36.2

chenjw505 avatar Apr 26 '21 01:04 chenjw505

it's most likely caused by pytorch. could you tell us what pytorch version you use. sourav1312 [email protected] 于 2021年3月6日周六 上午6:22写道: I trained a TransE model and ran the following code snippet for the model prediction - DGLBACKEND=pytorch dglke_predict --model_path ckpts/TransE_l1_JapEnc_2/ --format '_r_t' --data_files data/rel.list data/tail.list --score_func logsigmoid --exec_mode 'batch_head' --raw_data --entity_mfile data/entities.tsv --rel_mfile data/relations.tsv* On running this code, I encountered the following error - ckpts/TransE_l1_JapEnc_2/config.json {'dataset': 'JapEnc', 'model': 'TransE_l1', 'emb_size': 400, 'max_train_step': 500, 'batch_size': 1000, 'neg_sample_size': 200, 'lr': 0.01, 'gamma': 19.9, 'double_ent': False, 'double_rel': False, 'neg_adversarial_sampling': True, 'adversarial_temperature': 1.0, 'regularization_coef': 2e-08, 'regularization_norm': 3, 'emap_file': 'entities.tsv', 'rmap_file': 'relations.tsv'} Traceback (most recent call last): File "/usr/local/bin/dglke_predict", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/dglke/infer_score.py", line 216, in main result = model.topK(head, rel, tail, args.exec_mode, args.topK) File "/usr/local/lib/python3.7/dist-packages/dglke/models/infer.py", line 173, in topK F.asnumpy(rel[rel_idx]), IndexError: tensors used as indices must be long, byte or bool tensors Please help me with how to resolve this issue. Thanks in advance. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#189>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARGUNE6HMBUNVZR5RPLUTTCI3D3ANCNFSM4YWYYZHQ .

This issue can easily be resolved by casting the tensor indexers as long, e.g tail_idx = (idx % num_tail).long() Tried with PyTorch 1.10.0+cu111 (default version on Colab) with no issues. I think it might be worth to change it, since that as far as I've seen is the only thing keeping the package from running properly with current PyTorch versions.

alonj avatar Feb 13 '22 19:02 alonj