lightning-transformers
lightning-transformers copied to clipboard
Question answering example throws an exception even if sanity check is skipped
🐛 Bug
Running the squad example python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0
throws an exception while finalizing training. This is not a replication of #218
To Reproduce
Steps to reproduce the behavior:
- Run
python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0
- See error
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 12442/12445 [44:35<00:00, 4.65it/s, loss=0.957
Error executing job with overrides: ['task=nlp/question_answering', 'dataset=nlp/question_answering/squad', 'trainer.gpus=[1]', 'training.batch_size=8', 'trainer.num_sanity_val_steps=0']3 [01:39<00:00, 13.59it/s]
Traceback (most recent call last):
File "/home/vrt/lightning-transformers/train.py", line 10, in hydra_entry
main(cfg)
File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 69, in main
run(
File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 60, in run
trainer.fit(model, datamodule=data_module)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 146, in run
self.on_advance_end()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
self._run_validation()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
self.val_loop.run()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 151, in run
output = self.on_run_end()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 134, in on_run_end
self._on_evaluation_epoch_end()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 241, in _on_evaluation_epoch_end
self.trainer.call_hook(hook_name)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
output = model_fx(*args, **kwargs)
File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/model.py", line 59, in on_validation_epoch_end
metric_dict = self.metric.compute()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/torchmetrics/metric.py", line 380, in wrapped_func
value = compute(*args, **kwargs)
File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in compute
example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in <listcomp>
example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
KeyError: 0
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Environment
- PyTorch Version: 1.6.0
- OS: Ubuntu 18.04.6 LTS
- How you installed PyTorch:
conda
- Python version: 3.9.7
- CUDA/cuDNN version: 11.4
- GPU models and configuration: 2x NVIDIA GeForce RTX 2080 Ti (First device not used)
- Any other relevant information: The same error occurs during sanity check if
trainer.num_sanity_val_steps=-1
is used, as in #184
Strangely, I got the KeyError: 0
at some point earlier today without using trainer.num_sanity_val_steps=0
, but I haven't been able to reproduce it, nor do I get it when adding trainer.num_sanity_val_steps=0
as you say. Could caching be involved?
Ah, nevermind, this happens at the evaluation step, so we got to let it finish training the epoch first. I confirm I see this error too.
self.example_id_strings
seems to be empty at the time we use it to create reverse_lookup
, which will also be empty.
I attempt to fix this issue with PR #235.
@SeanNaren ^^ :rabbit:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Bad bot.
Strangely I cant close this issue myself?
The QA task is really broken... I don't have time to debug it but if anyone can help would appreciate it!
@mariomeissner, would you be interested in diving in and debugging this issue? :rabbit:
I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction 😄
I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction smile
I may say that the best would be just to check it out :)