kge icon indicating copy to clipboard operation
kge copied to clipboard

Error when running TransE

Open Filco306 opened this issue 4 years ago • 7 comments

Hello!

Again, thank you for a great repository. However, when I attempt to run TransE by running kge start config/transE-train.yaml with the file below, I get the error and output pasted below. Any idea what this might be because of?

My kge start config/transE-train.yaml:

job.type: train
dataset.name: fb15k-237

train:
  optimizer: Adagrad
  optimizer_args:
    lr: 0.2

valid:
  every: 5
  metric: mean_reciprocal_rank_filtered

model: transe
lookup_embedder:
  dim: 100
  regularize_weight: 0.8e-7

The output:

2021-08-16 11:11:57.190465 Using folder: /home/filco306/lib-kge-fork/local/experiments/20210816-111157-transE-train
2021-08-16 11:11:57.190544 Configuration:
2021-08-16 11:11:57.202096   1vsAll:
2021-08-16 11:11:57.202148     class_name: TrainingJob1vsAll
2021-08-16 11:11:57.202157   KvsAll:
2021-08-16 11:11:57.202165     class_name: TrainingJobKvsAll
2021-08-16 11:11:57.202173     label_smoothing: 0.0
2021-08-16 11:11:57.202180     query_types:
2021-08-16 11:11:57.202188       _po: true
2021-08-16 11:11:57.202196       s_o: false
2021-08-16 11:11:57.202203       sp_: true
2021-08-16 11:11:57.202211   ax_search:
2021-08-16 11:11:57.202219     class_name: AxSearchJob
2021-08-16 11:11:57.202227     num_sobol_trials: -1
2021-08-16 11:11:57.202235     num_trials: 10
2021-08-16 11:11:57.202243     parameter_constraints: []
2021-08-16 11:11:57.202250     parameters: []
2021-08-16 11:11:57.202259     sobol_seed: 0
2021-08-16 11:11:57.202266   console:
2021-08-16 11:11:57.202274     format: {}
2021-08-16 11:11:57.202282     quiet: false
2021-08-16 11:11:57.202290   dataset:
2021-08-16 11:11:57.202297     +++: +++
2021-08-16 11:11:57.202321     files:
2021-08-16 11:11:57.202329       +++: +++
2021-08-16 11:11:57.202337       entity_ids:
2021-08-16 11:11:57.202345         filename: entity_ids.del
2021-08-16 11:11:57.202353         type: map
2021-08-16 11:11:57.202360       entity_strings:
2021-08-16 11:11:57.202368         filename: entity_ids.del
2021-08-16 11:11:57.202376         type: map
2021-08-16 11:11:57.202383       relation_ids:
2021-08-16 11:11:57.202391         filename: relation_ids.del
2021-08-16 11:11:57.202399         type: map
2021-08-16 11:11:57.202407       relation_strings:
2021-08-16 11:11:57.202413         filename: relation_ids.del
2021-08-16 11:11:57.202421         type: map
2021-08-16 11:11:57.202429       test:
2021-08-16 11:11:57.202436         filename: test.del
2021-08-16 11:11:57.202442         type: triples
2021-08-16 11:11:57.202452       train:
2021-08-16 11:11:57.202460         filename: train.del
2021-08-16 11:11:57.202467         type: triples
2021-08-16 11:11:57.202474       valid:
2021-08-16 11:11:57.202482         filename: valid.del
2021-08-16 11:11:57.202490         type: triples
2021-08-16 11:11:57.202498     name: fb15k-237
2021-08-16 11:11:57.202506     num_entities: -1
2021-08-16 11:11:57.202513     num_relations: -1
2021-08-16 11:11:57.202521     pickle: true
2021-08-16 11:11:57.202528   entity_ranking:
2021-08-16 11:11:57.202537     chunk_size: -1
2021-08-16 11:11:57.202544     class_name: EntityRankingJob
2021-08-16 11:11:57.202553     filter_splits:
2021-08-16 11:11:57.202561     - train
2021-08-16 11:11:57.202569     - valid
2021-08-16 11:11:57.202577     filter_with_test: true
2021-08-16 11:11:57.202584     hits_at_k_s:
2021-08-16 11:11:57.202591     - 1
2021-08-16 11:11:57.202600     - 3
2021-08-16 11:11:57.202608     - 10
2021-08-16 11:11:57.202615     - 50
2021-08-16 11:11:57.202623     - 100
2021-08-16 11:11:57.202631     - 200
2021-08-16 11:11:57.202640     - 300
2021-08-16 11:11:57.202648     - 400
2021-08-16 11:11:57.202657     - 500
2021-08-16 11:11:57.202664     - 1000
2021-08-16 11:11:57.202672     metrics_per:
2021-08-16 11:11:57.202679       argument_frequency: false
2021-08-16 11:11:57.202687       head_and_tail: false
2021-08-16 11:11:57.202694       relation_type: false
2021-08-16 11:11:57.202702     tie_handling: rounded_mean_rank
2021-08-16 11:11:57.202709   eval:
2021-08-16 11:11:57.202716     batch_size: 100
2021-08-16 11:11:57.202723     num_workers: 0
2021-08-16 11:11:57.202730     pin_memory: false
2021-08-16 11:11:57.202739     split: valid
2021-08-16 11:11:57.202747     trace_level: epoch
2021-08-16 11:11:57.202755     type: entity_ranking
2021-08-16 11:11:57.202763   grid_search:
2021-08-16 11:11:57.202771     class_name: GridSearchJob
2021-08-16 11:11:57.202780     parameters:
2021-08-16 11:11:57.202788       +++: +++
2021-08-16 11:11:57.202795     run: true
2021-08-16 11:11:57.202803   import:
2021-08-16 11:11:57.202811   - transe
2021-08-16 11:11:57.202818   job:
2021-08-16 11:11:57.202826     device: cuda
2021-08-16 11:11:57.202833     type: train
2021-08-16 11:11:57.202842   lookup_embedder:
2021-08-16 11:11:57.202849     class_name: LookupEmbedder
2021-08-16 11:11:57.202856     dim: 100
2021-08-16 11:11:57.202864     dropout: 0.0
2021-08-16 11:11:57.202872     initialize: normal_
2021-08-16 11:11:57.202880     initialize_args:
2021-08-16 11:11:57.202888       +++: +++
2021-08-16 11:11:57.202896     normalize:
2021-08-16 11:11:57.202903       p: -1.0
2021-08-16 11:11:57.202910     pretrain:
2021-08-16 11:11:57.202918       ensure_all: false
2021-08-16 11:11:57.202926       model_filename: ''
2021-08-16 11:11:57.202934     regularize: lp
2021-08-16 11:11:57.202941     regularize_args:
2021-08-16 11:11:57.202949       +++: +++
2021-08-16 11:11:57.202959       p: 2
2021-08-16 11:11:57.202967       weighted: false
2021-08-16 11:11:57.202975     regularize_weight: 8.0e-08
2021-08-16 11:11:57.202983     round_dim_to: []
2021-08-16 11:11:57.202991     sparse: false
2021-08-16 11:11:57.202999   manual_search:
2021-08-16 11:11:57.203007     class_name: ManualSearchJob
2021-08-16 11:11:57.203015     configurations: []
2021-08-16 11:11:57.203022     run: true
2021-08-16 11:11:57.203030   model: transe
2021-08-16 11:11:57.203038   modules:
2021-08-16 11:11:57.203046   - kge.job
2021-08-16 11:11:57.203053   - kge.model
2021-08-16 11:11:57.203061   - kge.model.embedder
2021-08-16 11:11:57.203068   negative_sampling:
2021-08-16 11:11:57.203077     class_name: TrainingJobNegativeSampling
2021-08-16 11:11:57.203084     filtering:
2021-08-16 11:11:57.203092       implementation: fast_if_available
2021-08-16 11:11:57.203099       o: false
2021-08-16 11:11:57.203107       p: false
2021-08-16 11:11:57.203114       s: false
2021-08-16 11:11:57.203122       split: ''
2021-08-16 11:11:57.203129     frequency:
2021-08-16 11:11:57.203137       smoothing: 1
2021-08-16 11:11:57.203144     implementation: auto
2021-08-16 11:11:57.203152     num_samples:
2021-08-16 11:11:57.203159       o: -1
2021-08-16 11:11:57.203167       p: 0
2021-08-16 11:11:57.203174       s: 3
2021-08-16 11:11:57.203182     sampling_type: uniform
2021-08-16 11:11:57.203190     shared: false
2021-08-16 11:11:57.203197     shared_type: default
2021-08-16 11:11:57.203205     with_replacement: true
2021-08-16 11:11:57.203213   random_seed:
2021-08-16 11:11:57.203226     default: -1
2021-08-16 11:11:57.203234     numba: -1
2021-08-16 11:11:57.203242     numpy: -1
2021-08-16 11:11:57.203249     python: -1
2021-08-16 11:11:57.203256     torch: -1
2021-08-16 11:11:57.203264   search:
2021-08-16 11:11:57.203272     device_pool: []
2021-08-16 11:11:57.203280     num_workers: 1
2021-08-16 11:11:57.203287     on_error: abort
2021-08-16 11:11:57.203312     type: ax_search
2021-08-16 11:11:57.203320   train:
2021-08-16 11:11:57.203328     abort_on_nan: true
2021-08-16 11:11:57.203336     auto_correct: false
2021-08-16 11:11:57.203344     batch_size: 100
2021-08-16 11:11:57.203352     checkpoint:
2021-08-16 11:11:57.203360       every: 5
2021-08-16 11:11:57.203370       keep: 3
2021-08-16 11:11:57.203378       keep_init: true
2021-08-16 11:11:57.203387     loss: kl
2021-08-16 11:11:57.203394     loss_arg: .nan
2021-08-16 11:11:57.203402     lr_scheduler: ''
2021-08-16 11:11:57.203410     lr_scheduler_args:
2021-08-16 11:11:57.203420       +++: +++
2021-08-16 11:11:57.203427     lr_warmup: 0
2021-08-16 11:11:57.203435     max_epochs: 20
2021-08-16 11:11:57.203443     num_workers: 0
2021-08-16 11:11:57.203452     optimizer:
2021-08-16 11:11:57.203460       +++: +++
2021-08-16 11:11:57.203469       default:
2021-08-16 11:11:57.203477         args:
2021-08-16 11:11:57.203485           +++: +++
2021-08-16 11:11:57.203493           lr: 0.2
2021-08-16 11:11:57.203502         type: Adagrad
2021-08-16 11:11:57.203509     pin_memory: false
2021-08-16 11:11:57.203518     split: train
2021-08-16 11:11:57.203527     subbatch_auto_tune: false
2021-08-16 11:11:57.203534     subbatch_size: -1
2021-08-16 11:11:57.203542     trace_level: epoch
2021-08-16 11:11:57.203551     type: KvsAll
2021-08-16 11:11:57.203558     visualize_graph: false
2021-08-16 11:11:57.203567   training_loss:
2021-08-16 11:11:57.203575     class_name: TrainingLossEvaluationJob
2021-08-16 11:11:57.203609   transe:
2021-08-16 11:11:57.203619     class_name: TransE
2021-08-16 11:11:57.203627     entity_embedder:
2021-08-16 11:11:57.203635       +++: +++
2021-08-16 11:11:57.203644       type: lookup_embedder
2021-08-16 11:11:57.203652     l_norm: 1.0
2021-08-16 11:11:57.203661     relation_embedder:
2021-08-16 11:11:57.203669       +++: +++
2021-08-16 11:11:57.203677       type: lookup_embedder
2021-08-16 11:11:57.203685   user:
2021-08-16 11:11:57.203694     +++: +++
2021-08-16 11:11:57.203701   valid:
2021-08-16 11:11:57.203710     early_stopping:
2021-08-16 11:11:57.203718       patience: 5
2021-08-16 11:11:57.203726       threshold:
2021-08-16 11:11:57.203735         epochs: 0
2021-08-16 11:11:57.203743         metric_value: 0.0
2021-08-16 11:11:57.203751     every: 5
2021-08-16 11:11:57.203759     metric: mean_reciprocal_rank_filtered
2021-08-16 11:11:57.203767     metric_expr: float("nan")
2021-08-16 11:11:57.203776     metric_max: true
2021-08-16 11:11:57.203784     split: valid
2021-08-16 11:11:57.203793     trace_level: epoch
2021-08-16 11:11:57.216803   git commit: 2ecac7f
2021-08-16 11:11:57.217376 Loading configuration of dataset fb15k-237 from /home/filco306/lib-kge-fork/data/fb15k-237 ...
2021-08-16 11:11:57.223687 Loaded 14541 keys from map entity_ids
2021-08-16 11:11:57.223890 Loaded 237 keys from map relation_ids
2021-08-16 11:11:57.228590 Loaded 272115 train triples
2021-08-16 11:11:57.229049 Loaded 17535 valid triples
2021-08-16 11:11:57.229460 Loaded 20466 test triples
2021-08-16 11:12:00.380729 [dc8a8332] Initializing 1-to-N training job...
2021-08-16 11:12:01.577768 [dc8a8332]   93372 distinct sp pairs in train
2021-08-16 11:12:01.585548 [dc8a8332]   56317 distinct po pairs in train
2021-08-16 11:12:01.585811 [dc8a8332] Saving checkpoint to /home/filco306/lib-kge-fork/local/experiments/20210816-111157-transE-train/checkpoint_00000.pt...
2021-08-16 11:12:01.620817 [dc8a8332] Starting training...
2021-08-16 11:12:01.620906 [dc8a8332] Starting epoch 1...
2021-08-16 11:12:01.685297 [dc8a8332] CUDA memory after first batch: allocated=    17,740,288 reserved=   507,510,784 max_allocated=   483,646,976
2021-08-16 11:12:01.946906 [dc8a8332] Traceback (most recent call last):
2021-08-16 11:12:01.946918 [dc8a8332]   File "/home/filco306/lib-kge-fork/kge/cli.py", line 285, in main
2021-08-16 11:12:01.946921 [dc8a8332]     job.run()
2021-08-16 11:12:01.946923 [dc8a8332]   File "/home/filco306/lib-kge-fork/kge/job/job.py", line 159, in run
2021-08-16 11:12:01.946925 [dc8a8332]     result = self._run()
2021-08-16 11:12:01.946928 [dc8a8332]   File "/home/filco306/lib-kge-fork/kge/job/train.py", line 206, in _run
2021-08-16 11:12:01.946930 [dc8a8332]     trace_entry = self.run_epoch()
2021-08-16 11:12:01.946932 [dc8a8332]   File "/home/filco306/lib-kge-fork/kge/job/train.py", line 389, in run_epoch
2021-08-16 11:12:01.946934 [dc8a8332]     raise e
2021-08-16 11:12:01.946936 [dc8a8332]   File "/home/filco306/lib-kge-fork/kge/job/train.py", line 378, in run_epoch
2021-08-16 11:12:01.946938 [dc8a8332]     batch_result: TrainingJob._ProcessBatchResult = self._process_batch(
2021-08-16 11:12:01.946940 [dc8a8332]   File "/home/filco306/lib-kge-fork/kge/job/train.py", line 606, in _process_batch
2021-08-16 11:12:01.946942 [dc8a8332]     self._process_subbatch(batch_index, batch, subbatch_slice, result)
2021-08-16 11:12:01.946944 [dc8a8332]   File "/home/filco306/lib-kge-fork/kge/job/train_KvsAll.py", line 294, in _process_subbatch
2021-08-16 11:12:01.946946 [dc8a8332]     loss_value.backward()
2021-08-16 11:12:01.946949 [dc8a8332]   File "/home/filco306/.local/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
2021-08-16 11:12:01.946951 [dc8a8332]     torch.autograd.backward(self, gradient, retain_graph, create_graph)
2021-08-16 11:12:01.946953 [dc8a8332]   File "/home/filco306/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
2021-08-16 11:12:01.946955 [dc8a8332]     Variable._execution_engine.run_backward(
2021-08-16 11:12:01.946957 [dc8a8332] RuntimeError: CUDA error: invalid configuration argument

And again, thank you for a great repository!

Filco306 avatar Aug 16 '21 11:08 Filco306

I can add here that the problem seems to be with the fact that cdist is being used, as described here. However, it should be fixed; I am using torch==1.7.1, and that should have been fixed. This thread indicates that the problem should have been fixed by 1.5.0.

Filco306 avatar Aug 16 '21 11:08 Filco306

I seem to have fixed this problem now, by upgrading torch to 1.9.0.

Filco306 avatar Aug 16 '21 11:08 Filco306

I met the exact same problem if I train TransE using KvsAll type. But training it with NegativeSampling will solve the problem. Any insight on this?

jwzhi avatar Oct 27 '21 16:10 jwzhi

Yes, did you upgrade to torch 1.9.0?

Filco306 avatar Oct 27 '21 19:10 Filco306

Nope. I updated to 1.9.0 and the same problem exists. That's weird.

jwzhi avatar Nov 10 '21 13:11 jwzhi

Confirmed, I see this problem with your config as well.

rgemulla avatar Nov 10 '21 14:11 rgemulla

Thanks for reopening the issue :). It's great help for our research!

jwzhi avatar Nov 10 '21 14:11 jwzhi

Should be fixed by now.

rgemulla avatar Jun 28 '23 13:06 rgemulla