NeMo
NeMo copied to clipboard
Entity Linking gives error while training
Hi,
I followed this tutorial completely: https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Entity_Linking_Medical.ipynb#scrollTo=gO3t67PnmtL7
Now, I'm following these steps to train model on entire UMLs data: !conda install python=3.7.13 --y !pip install Cython !python -m pip install 'git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[all]' !pip install -q faiss-gpu !pip install -q folium==0.2.1
!pip install -q imgaug==0.2.5 !git clone https://github.com/NVIDIA/apex !cd apex !pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex !python "/home/ml-002/Fatima/NeMo/examples/nlp/entity_linking/data/umls_dataset_processing.py" --project_dir "./NeMo/examples/nlp/entity_linking" !python self_alignment_pretraining.py project_dir=.
However, as soon as epoch 0 reaches 1% I get this error:
Epoch 0: 1%| | 16870/2147821 [5:23:29<681:01:39, 1.15s/it, loss=0.093, v_num=
Epoch 0, global step 7000: 'val_loss' reached 0.31622 (best 0.31622), saving model to '/home/ml-002/Fatima/NeMo/examples/nlp/entity_linking/medical_entity_linking_experiments/sap_bert_umls/2022-07-07_20-20-04/checkpoints/sap_bert_umls--val_loss=0.3162-epoch=0.ckpt' as top 3
Epoch 0: 1%| | 17283/2147821 [5:25:50<669:28:28, 1.13s/it, loss=0.0938, v_num[NeMo I 2022-07-08 01:46:18 multi_similarity_loss:91] Encountered zero loss in multisimloss, loss = 0.0. No hard examples found in the batch
Error executing job with overrides: ['project_dir=.']
Traceback (most recent call last):
File "self_alignment_pretraining.py", line 38, in main
trainer.fit(model)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 207, in advance
self.optimizer_idx,
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1646, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 286, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 85, in optimizer_step
closure_result = closure()
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in call
self._result = self.closure(*args, **kwargs)
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
step_output = self._step_fn()
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 437, in _training_step
training_step_output, self.trainer.accumulate_grad_batches
File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 75, in from_training_step_output
"In automatic_optimization, when training_step
returns a dict, the 'loss' key needs to be present"
pytorch_lightning.utilities.exceptions.MisconfigurationException: In automatic_optimization, when training_step
returns a dict, the 'loss' key needs to be present
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Epoch 0: 1%| | 17283/2147821 [5:25:53<669:33:18, 1.13s/it, loss=0.0938, v_num
I checked file and training_step does have loss key. What is the reason behind this error? How do I catch this exception and continue with execution? Please help!! I have been stuck on this for 2 weeks.
Im using Ubuntu
We also encounter the same bug with the full UMLS dataset.
The small tutorial performs as expected however.
Encountered same bug, and the only thing I changed was the batch sizes for training and validation pairs to 16, for them to work with my available RAM
Change method training_step() in entity_linking_model.py. You can either remove if condition which assigns None to train_loss or simply comment it out and assign a very small lr to lr.
This issue is stale because it has been open for 60 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.