lightning-transformers icon indicating copy to clipboard operation
lightning-transformers copied to clipboard

Question answering example throws an exception even if sanity check is skipped

Open Pointy-Hat opened this issue 3 years ago • 10 comments

🐛 Bug

Running the squad example python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0 throws an exception while finalizing training. This is not a replication of #218

To Reproduce

Steps to reproduce the behavior:

  1. Run python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0
  2. See error
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 12442/12445 [44:35<00:00,  4.65it/s, loss=0.957
Error executing job with overrides: ['task=nlp/question_answering', 'dataset=nlp/question_answering/squad', 'trainer.gpus=[1]', 'training.batch_size=8', 'trainer.num_sanity_val_steps=0']3 [01:39<00:00, 13.59it/s]
Traceback (most recent call last):
  File "/home/vrt/lightning-transformers/train.py", line 10, in hydra_entry
    main(cfg)
  File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 69, in main
    run(
  File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 60, in run
    trainer.fit(model, datamodule=data_module)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 146, in run
    self.on_advance_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
    self._run_validation()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
    self.val_loop.run()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 134, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 241, in _on_evaluation_epoch_end
    self.trainer.call_hook(hook_name)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/model.py", line 59, in on_validation_epoch_end
    metric_dict = self.metric.compute()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/torchmetrics/metric.py", line 380, in wrapped_func
    value = compute(*args, **kwargs)
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in compute
    example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in <listcomp>
    example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
KeyError: 0

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Environment

  • PyTorch Version: 1.6.0
  • OS: Ubuntu 18.04.6 LTS
  • How you installed PyTorch: conda
  • Python version: 3.9.7
  • CUDA/cuDNN version: 11.4
  • GPU models and configuration: 2x NVIDIA GeForce RTX 2080 Ti (First device not used)
  • Any other relevant information: The same error occurs during sanity check if trainer.num_sanity_val_steps=-1 is used, as in #184

Pointy-Hat avatar Feb 22 '22 16:02 Pointy-Hat

Strangely, I got the KeyError: 0 at some point earlier today without using trainer.num_sanity_val_steps=0, but I haven't been able to reproduce it, nor do I get it when adding trainer.num_sanity_val_steps=0 as you say. Could caching be involved?

mariomeissner avatar Mar 19 '22 03:03 mariomeissner

Ah, nevermind, this happens at the evaluation step, so we got to let it finish training the epoch first. I confirm I see this error too.

mariomeissner avatar Mar 19 '22 03:03 mariomeissner

self.example_id_strings seems to be empty at the time we use it to create reverse_lookup, which will also be empty.

mariomeissner avatar Mar 19 '22 07:03 mariomeissner

I attempt to fix this issue with PR #235.

mariomeissner avatar Mar 21 '22 07:03 mariomeissner

@SeanNaren ^^ :rabbit:

Borda avatar Apr 10 '22 05:04 Borda

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 12 '22 12:06 stale[bot]

Bad bot.

Strangely I cant close this issue myself?

mariomeissner avatar Jun 12 '22 23:06 mariomeissner

The QA task is really broken... I don't have time to debug it but if anyone can help would appreciate it!

SeanNaren avatar Jun 23 '22 10:06 SeanNaren

@mariomeissner, would you be interested in diving in and debugging this issue? :rabbit:

Borda avatar Sep 14 '22 22:09 Borda

I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction 😄

mariomeissner avatar Sep 15 '22 00:09 mariomeissner

I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction smile

I may say that the best would be just to check it out :)

Borda avatar Nov 07 '22 21:11 Borda