NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Entity Linking gives error while training

Open FatimaArshad-DS opened this issue 2 years ago • 3 comments

Hi,

I followed this tutorial completely: https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Entity_Linking_Medical.ipynb#scrollTo=gO3t67PnmtL7

Now, I'm following these steps to train model on entire UMLs data: !conda install python=3.7.13 --y !pip install Cython !python -m pip install 'git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[all]' !pip install -q faiss-gpu !pip install -q folium==0.2.1

!pip install -q imgaug==0.2.5 !git clone https://github.com/NVIDIA/apex !cd apex !pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex !python "/home/ml-002/Fatima/NeMo/examples/nlp/entity_linking/data/umls_dataset_processing.py" --project_dir "./NeMo/examples/nlp/entity_linking" !python self_alignment_pretraining.py project_dir=.

However, as soon as epoch 0 reaches 1% I get this error: Epoch 0: 1%| | 16870/2147821 [5:23:29<681:01:39, 1.15s/it, loss=0.093, v_num= Epoch 0, global step 7000: 'val_loss' reached 0.31622 (best 0.31622), saving model to '/home/ml-002/Fatima/NeMo/examples/nlp/entity_linking/medical_entity_linking_experiments/sap_bert_umls/2022-07-07_20-20-04/checkpoints/sap_bert_umls--val_loss=0.3162-epoch=0.ckpt' as top 3 Epoch 0: 1%| | 17283/2147821 [5:25:50<669:28:28, 1.13s/it, loss=0.0938, v_num[NeMo I 2022-07-08 01:46:18 multi_similarity_loss:91] Encountered zero loss in multisimloss, loss = 0.0. No hard examples found in the batch Error executing job with overrides: ['project_dir=.'] Traceback (most recent call last): File "self_alignment_pretraining.py", line 38, in main trainer.fit(model) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run results = self._run_stage() File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage return self._run_train() File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train self.fit_loop.run() File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance batch_output = self.batch_loop.run(batch, batch_idx) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 207, in advance self.optimizer_idx, File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization self._optimizer_step(optimizer, opt_idx, batch_idx, closure) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step using_lbfgs=is_lbfgs, File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1646, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 286, in optimizer_step optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 85, in optimizer_step closure_result = closure() File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in call self._result = self.closure(*args, **kwargs) File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure step_output = self._step_fn() File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 437, in _training_step training_step_output, self.trainer.accumulate_grad_batches File "/home/ml-002/anaconda3/envs/py39_ft/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 75, in from_training_step_output "In automatic_optimization, when training_step returns a dict, the 'loss' key needs to be present" pytorch_lightning.utilities.exceptions.MisconfigurationException: In automatic_optimization, when training_step returns a dict, the 'loss' key needs to be present

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Epoch 0: 1%| | 17283/2147821 [5:25:53<669:33:18, 1.13s/it, loss=0.0938, v_num

I checked file and training_step does have loss key. What is the reason behind this error? How do I catch this exception and continue with execution? Please help!! I have been stuck on this for 2 weeks.

Im using Ubuntu

FatimaArshad-DS avatar Jul 08 '22 08:07 FatimaArshad-DS

We also encounter the same bug with the full UMLS dataset.

The small tutorial performs as expected however.

emerson-h avatar Jul 26 '22 17:07 emerson-h

Encountered same bug, and the only thing I changed was the batch sizes for training and validation pairs to 16, for them to work with my available RAM

fahadd01 avatar Jul 29 '22 14:07 fahadd01

Change method training_step() in entity_linking_model.py. You can either remove if condition which assigns None to train_loss or simply comment it out and assign a very small lr to lr.

FatimaArshad-DS avatar Aug 01 '22 07:08 FatimaArshad-DS

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Oct 01 '22 02:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Oct 09 '22 02:10 github-actions[bot]