Error when running TransE
Hello!
Again, thank you for a great repository. However, when I attempt to run TransE by running kge start config/transE-train.yaml with the file below, I get the error and output pasted below. Any idea what this might be because of?
My kge start config/transE-train.yaml:
job.type: train
dataset.name: fb15k-237
train:
optimizer: Adagrad
optimizer_args:
lr: 0.2
valid:
every: 5
metric: mean_reciprocal_rank_filtered
model: transe
lookup_embedder:
dim: 100
regularize_weight: 0.8e-7
The output:
2021-08-16 11:11:57.190465 Using folder: /home/filco306/lib-kge-fork/local/experiments/20210816-111157-transE-train
2021-08-16 11:11:57.190544 Configuration:
2021-08-16 11:11:57.202096 1vsAll:
2021-08-16 11:11:57.202148 class_name: TrainingJob1vsAll
2021-08-16 11:11:57.202157 KvsAll:
2021-08-16 11:11:57.202165 class_name: TrainingJobKvsAll
2021-08-16 11:11:57.202173 label_smoothing: 0.0
2021-08-16 11:11:57.202180 query_types:
2021-08-16 11:11:57.202188 _po: true
2021-08-16 11:11:57.202196 s_o: false
2021-08-16 11:11:57.202203 sp_: true
2021-08-16 11:11:57.202211 ax_search:
2021-08-16 11:11:57.202219 class_name: AxSearchJob
2021-08-16 11:11:57.202227 num_sobol_trials: -1
2021-08-16 11:11:57.202235 num_trials: 10
2021-08-16 11:11:57.202243 parameter_constraints: []
2021-08-16 11:11:57.202250 parameters: []
2021-08-16 11:11:57.202259 sobol_seed: 0
2021-08-16 11:11:57.202266 console:
2021-08-16 11:11:57.202274 format: {}
2021-08-16 11:11:57.202282 quiet: false
2021-08-16 11:11:57.202290 dataset:
2021-08-16 11:11:57.202297 +++: +++
2021-08-16 11:11:57.202321 files:
2021-08-16 11:11:57.202329 +++: +++
2021-08-16 11:11:57.202337 entity_ids:
2021-08-16 11:11:57.202345 filename: entity_ids.del
2021-08-16 11:11:57.202353 type: map
2021-08-16 11:11:57.202360 entity_strings:
2021-08-16 11:11:57.202368 filename: entity_ids.del
2021-08-16 11:11:57.202376 type: map
2021-08-16 11:11:57.202383 relation_ids:
2021-08-16 11:11:57.202391 filename: relation_ids.del
2021-08-16 11:11:57.202399 type: map
2021-08-16 11:11:57.202407 relation_strings:
2021-08-16 11:11:57.202413 filename: relation_ids.del
2021-08-16 11:11:57.202421 type: map
2021-08-16 11:11:57.202429 test:
2021-08-16 11:11:57.202436 filename: test.del
2021-08-16 11:11:57.202442 type: triples
2021-08-16 11:11:57.202452 train:
2021-08-16 11:11:57.202460 filename: train.del
2021-08-16 11:11:57.202467 type: triples
2021-08-16 11:11:57.202474 valid:
2021-08-16 11:11:57.202482 filename: valid.del
2021-08-16 11:11:57.202490 type: triples
2021-08-16 11:11:57.202498 name: fb15k-237
2021-08-16 11:11:57.202506 num_entities: -1
2021-08-16 11:11:57.202513 num_relations: -1
2021-08-16 11:11:57.202521 pickle: true
2021-08-16 11:11:57.202528 entity_ranking:
2021-08-16 11:11:57.202537 chunk_size: -1
2021-08-16 11:11:57.202544 class_name: EntityRankingJob
2021-08-16 11:11:57.202553 filter_splits:
2021-08-16 11:11:57.202561 - train
2021-08-16 11:11:57.202569 - valid
2021-08-16 11:11:57.202577 filter_with_test: true
2021-08-16 11:11:57.202584 hits_at_k_s:
2021-08-16 11:11:57.202591 - 1
2021-08-16 11:11:57.202600 - 3
2021-08-16 11:11:57.202608 - 10
2021-08-16 11:11:57.202615 - 50
2021-08-16 11:11:57.202623 - 100
2021-08-16 11:11:57.202631 - 200
2021-08-16 11:11:57.202640 - 300
2021-08-16 11:11:57.202648 - 400
2021-08-16 11:11:57.202657 - 500
2021-08-16 11:11:57.202664 - 1000
2021-08-16 11:11:57.202672 metrics_per:
2021-08-16 11:11:57.202679 argument_frequency: false
2021-08-16 11:11:57.202687 head_and_tail: false
2021-08-16 11:11:57.202694 relation_type: false
2021-08-16 11:11:57.202702 tie_handling: rounded_mean_rank
2021-08-16 11:11:57.202709 eval:
2021-08-16 11:11:57.202716 batch_size: 100
2021-08-16 11:11:57.202723 num_workers: 0
2021-08-16 11:11:57.202730 pin_memory: false
2021-08-16 11:11:57.202739 split: valid
2021-08-16 11:11:57.202747 trace_level: epoch
2021-08-16 11:11:57.202755 type: entity_ranking
2021-08-16 11:11:57.202763 grid_search:
2021-08-16 11:11:57.202771 class_name: GridSearchJob
2021-08-16 11:11:57.202780 parameters:
2021-08-16 11:11:57.202788 +++: +++
2021-08-16 11:11:57.202795 run: true
2021-08-16 11:11:57.202803 import:
2021-08-16 11:11:57.202811 - transe
2021-08-16 11:11:57.202818 job:
2021-08-16 11:11:57.202826 device: cuda
2021-08-16 11:11:57.202833 type: train
2021-08-16 11:11:57.202842 lookup_embedder:
2021-08-16 11:11:57.202849 class_name: LookupEmbedder
2021-08-16 11:11:57.202856 dim: 100
2021-08-16 11:11:57.202864 dropout: 0.0
2021-08-16 11:11:57.202872 initialize: normal_
2021-08-16 11:11:57.202880 initialize_args:
2021-08-16 11:11:57.202888 +++: +++
2021-08-16 11:11:57.202896 normalize:
2021-08-16 11:11:57.202903 p: -1.0
2021-08-16 11:11:57.202910 pretrain:
2021-08-16 11:11:57.202918 ensure_all: false
2021-08-16 11:11:57.202926 model_filename: ''
2021-08-16 11:11:57.202934 regularize: lp
2021-08-16 11:11:57.202941 regularize_args:
2021-08-16 11:11:57.202949 +++: +++
2021-08-16 11:11:57.202959 p: 2
2021-08-16 11:11:57.202967 weighted: false
2021-08-16 11:11:57.202975 regularize_weight: 8.0e-08
2021-08-16 11:11:57.202983 round_dim_to: []
2021-08-16 11:11:57.202991 sparse: false
2021-08-16 11:11:57.202999 manual_search:
2021-08-16 11:11:57.203007 class_name: ManualSearchJob
2021-08-16 11:11:57.203015 configurations: []
2021-08-16 11:11:57.203022 run: true
2021-08-16 11:11:57.203030 model: transe
2021-08-16 11:11:57.203038 modules:
2021-08-16 11:11:57.203046 - kge.job
2021-08-16 11:11:57.203053 - kge.model
2021-08-16 11:11:57.203061 - kge.model.embedder
2021-08-16 11:11:57.203068 negative_sampling:
2021-08-16 11:11:57.203077 class_name: TrainingJobNegativeSampling
2021-08-16 11:11:57.203084 filtering:
2021-08-16 11:11:57.203092 implementation: fast_if_available
2021-08-16 11:11:57.203099 o: false
2021-08-16 11:11:57.203107 p: false
2021-08-16 11:11:57.203114 s: false
2021-08-16 11:11:57.203122 split: ''
2021-08-16 11:11:57.203129 frequency:
2021-08-16 11:11:57.203137 smoothing: 1
2021-08-16 11:11:57.203144 implementation: auto
2021-08-16 11:11:57.203152 num_samples:
2021-08-16 11:11:57.203159 o: -1
2021-08-16 11:11:57.203167 p: 0
2021-08-16 11:11:57.203174 s: 3
2021-08-16 11:11:57.203182 sampling_type: uniform
2021-08-16 11:11:57.203190 shared: false
2021-08-16 11:11:57.203197 shared_type: default
2021-08-16 11:11:57.203205 with_replacement: true
2021-08-16 11:11:57.203213 random_seed:
2021-08-16 11:11:57.203226 default: -1
2021-08-16 11:11:57.203234 numba: -1
2021-08-16 11:11:57.203242 numpy: -1
2021-08-16 11:11:57.203249 python: -1
2021-08-16 11:11:57.203256 torch: -1
2021-08-16 11:11:57.203264 search:
2021-08-16 11:11:57.203272 device_pool: []
2021-08-16 11:11:57.203280 num_workers: 1
2021-08-16 11:11:57.203287 on_error: abort
2021-08-16 11:11:57.203312 type: ax_search
2021-08-16 11:11:57.203320 train:
2021-08-16 11:11:57.203328 abort_on_nan: true
2021-08-16 11:11:57.203336 auto_correct: false
2021-08-16 11:11:57.203344 batch_size: 100
2021-08-16 11:11:57.203352 checkpoint:
2021-08-16 11:11:57.203360 every: 5
2021-08-16 11:11:57.203370 keep: 3
2021-08-16 11:11:57.203378 keep_init: true
2021-08-16 11:11:57.203387 loss: kl
2021-08-16 11:11:57.203394 loss_arg: .nan
2021-08-16 11:11:57.203402 lr_scheduler: ''
2021-08-16 11:11:57.203410 lr_scheduler_args:
2021-08-16 11:11:57.203420 +++: +++
2021-08-16 11:11:57.203427 lr_warmup: 0
2021-08-16 11:11:57.203435 max_epochs: 20
2021-08-16 11:11:57.203443 num_workers: 0
2021-08-16 11:11:57.203452 optimizer:
2021-08-16 11:11:57.203460 +++: +++
2021-08-16 11:11:57.203469 default:
2021-08-16 11:11:57.203477 args:
2021-08-16 11:11:57.203485 +++: +++
2021-08-16 11:11:57.203493 lr: 0.2
2021-08-16 11:11:57.203502 type: Adagrad
2021-08-16 11:11:57.203509 pin_memory: false
2021-08-16 11:11:57.203518 split: train
2021-08-16 11:11:57.203527 subbatch_auto_tune: false
2021-08-16 11:11:57.203534 subbatch_size: -1
2021-08-16 11:11:57.203542 trace_level: epoch
2021-08-16 11:11:57.203551 type: KvsAll
2021-08-16 11:11:57.203558 visualize_graph: false
2021-08-16 11:11:57.203567 training_loss:
2021-08-16 11:11:57.203575 class_name: TrainingLossEvaluationJob
2021-08-16 11:11:57.203609 transe:
2021-08-16 11:11:57.203619 class_name: TransE
2021-08-16 11:11:57.203627 entity_embedder:
2021-08-16 11:11:57.203635 +++: +++
2021-08-16 11:11:57.203644 type: lookup_embedder
2021-08-16 11:11:57.203652 l_norm: 1.0
2021-08-16 11:11:57.203661 relation_embedder:
2021-08-16 11:11:57.203669 +++: +++
2021-08-16 11:11:57.203677 type: lookup_embedder
2021-08-16 11:11:57.203685 user:
2021-08-16 11:11:57.203694 +++: +++
2021-08-16 11:11:57.203701 valid:
2021-08-16 11:11:57.203710 early_stopping:
2021-08-16 11:11:57.203718 patience: 5
2021-08-16 11:11:57.203726 threshold:
2021-08-16 11:11:57.203735 epochs: 0
2021-08-16 11:11:57.203743 metric_value: 0.0
2021-08-16 11:11:57.203751 every: 5
2021-08-16 11:11:57.203759 metric: mean_reciprocal_rank_filtered
2021-08-16 11:11:57.203767 metric_expr: float("nan")
2021-08-16 11:11:57.203776 metric_max: true
2021-08-16 11:11:57.203784 split: valid
2021-08-16 11:11:57.203793 trace_level: epoch
2021-08-16 11:11:57.216803 git commit: 2ecac7f
2021-08-16 11:11:57.217376 Loading configuration of dataset fb15k-237 from /home/filco306/lib-kge-fork/data/fb15k-237 ...
2021-08-16 11:11:57.223687 Loaded 14541 keys from map entity_ids
2021-08-16 11:11:57.223890 Loaded 237 keys from map relation_ids
2021-08-16 11:11:57.228590 Loaded 272115 train triples
2021-08-16 11:11:57.229049 Loaded 17535 valid triples
2021-08-16 11:11:57.229460 Loaded 20466 test triples
2021-08-16 11:12:00.380729 [dc8a8332] Initializing 1-to-N training job...
2021-08-16 11:12:01.577768 [dc8a8332] 93372 distinct sp pairs in train
2021-08-16 11:12:01.585548 [dc8a8332] 56317 distinct po pairs in train
2021-08-16 11:12:01.585811 [dc8a8332] Saving checkpoint to /home/filco306/lib-kge-fork/local/experiments/20210816-111157-transE-train/checkpoint_00000.pt...
2021-08-16 11:12:01.620817 [dc8a8332] Starting training...
2021-08-16 11:12:01.620906 [dc8a8332] Starting epoch 1...
2021-08-16 11:12:01.685297 [dc8a8332] CUDA memory after first batch: allocated= 17,740,288 reserved= 507,510,784 max_allocated= 483,646,976
2021-08-16 11:12:01.946906 [dc8a8332] Traceback (most recent call last):
2021-08-16 11:12:01.946918 [dc8a8332] File "/home/filco306/lib-kge-fork/kge/cli.py", line 285, in main
2021-08-16 11:12:01.946921 [dc8a8332] job.run()
2021-08-16 11:12:01.946923 [dc8a8332] File "/home/filco306/lib-kge-fork/kge/job/job.py", line 159, in run
2021-08-16 11:12:01.946925 [dc8a8332] result = self._run()
2021-08-16 11:12:01.946928 [dc8a8332] File "/home/filco306/lib-kge-fork/kge/job/train.py", line 206, in _run
2021-08-16 11:12:01.946930 [dc8a8332] trace_entry = self.run_epoch()
2021-08-16 11:12:01.946932 [dc8a8332] File "/home/filco306/lib-kge-fork/kge/job/train.py", line 389, in run_epoch
2021-08-16 11:12:01.946934 [dc8a8332] raise e
2021-08-16 11:12:01.946936 [dc8a8332] File "/home/filco306/lib-kge-fork/kge/job/train.py", line 378, in run_epoch
2021-08-16 11:12:01.946938 [dc8a8332] batch_result: TrainingJob._ProcessBatchResult = self._process_batch(
2021-08-16 11:12:01.946940 [dc8a8332] File "/home/filco306/lib-kge-fork/kge/job/train.py", line 606, in _process_batch
2021-08-16 11:12:01.946942 [dc8a8332] self._process_subbatch(batch_index, batch, subbatch_slice, result)
2021-08-16 11:12:01.946944 [dc8a8332] File "/home/filco306/lib-kge-fork/kge/job/train_KvsAll.py", line 294, in _process_subbatch
2021-08-16 11:12:01.946946 [dc8a8332] loss_value.backward()
2021-08-16 11:12:01.946949 [dc8a8332] File "/home/filco306/.local/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
2021-08-16 11:12:01.946951 [dc8a8332] torch.autograd.backward(self, gradient, retain_graph, create_graph)
2021-08-16 11:12:01.946953 [dc8a8332] File "/home/filco306/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
2021-08-16 11:12:01.946955 [dc8a8332] Variable._execution_engine.run_backward(
2021-08-16 11:12:01.946957 [dc8a8332] RuntimeError: CUDA error: invalid configuration argument
And again, thank you for a great repository!
I can add here that the problem seems to be with the fact that cdist is being used, as described here. However, it should be fixed; I am using torch==1.7.1, and that should have been fixed. This thread indicates that the problem should have been fixed by 1.5.0.
I seem to have fixed this problem now, by upgrading torch to 1.9.0.
I met the exact same problem if I train TransE using KvsAll type. But training it with NegativeSampling will solve the problem. Any insight on this?
Yes, did you upgrade to torch 1.9.0?
Nope. I updated to 1.9.0 and the same problem exists. That's weird.
Confirmed, I see this problem with your config as well.
Thanks for reopening the issue :). It's great help for our research!
Should be fixed by now.