NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

PyTorch KeyError for batch index with multiple validation sets

Open Numeri opened this issue 2 years ago • 0 comments

Describe the bug

Trying to train a transformer model for machine translation, following the examples in the docs. After many attempts, I managed to get it to start training, but during the first validation step, it successfully processes the first validation dataset, but throws a KeyError exception for the second dataset (no matter which dataset comes second):

Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/user/exp/nemo/ende/enc_dec_nmt.py", line 144, in main
    raise err
  File "/home/user/exp/nemo/ende/enc_dec_nmt.py", line 136, in main
    trainer.fit(mt_model)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
    self._run_validation()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
    self.val_loop.run()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
    batch = next(data_fetcher)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
    return self.fetching_function()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
    batch = next(iterator)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 13.
Original Traceback (most recent call last):
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/nemo/collections/nlp/data/machine_translation/machine_translation_dataset.py", line 140,
in __getitem__
    src_ids = self.batches[idx]["src"]
KeyError: 118

This seems to happen when I specify multiple validation datasets as a list. When specifying only one, training doesn't crash (or at least hasn't yet!). All train/validation/test files are in plain format (i.e. raw text).

Steps/Code to reproduce bug

This is my full YAML config.

name: nemo_ende_6x6_0.1_2k
do_training: true # set to false if only preprocessing data
do_testing: true # set to true to run evaluation on test data after training

model:
  beam_size: 1
  len_pen: 0.6
  multilingual: false
  max_generation_delta: -1
  label_smoothing: 0.1
  shared_tokenizer: true # train tokenizer model across src and tgt train data
  preproc_out_dir: data_output # path to store data preprocessing outputs
  src_language: 'en'
  tgt_language: 'de'
  shared_embeddings: false

  train_ds:
    src_file_name: data/train/train_5m.en
    tgt_file_name: data/train/train_5m.de
    use_tarred_dataset: true # if true tar_file_name and meta_file_name will be used (or created automatically)
    # config for preprocessing training data and creating a tarred datset automatically
    tar_file_prefix: train_5m # prefix for tar file names
    tar_files: null # if data has already been preprocessed (rest of config ignored)
    metadata_file: data_output/metadata.tokens.2000.json # metadata for tarred dataset
    lines_per_dataset_fragment: 1000000 # Number of lines to consider for bucketing and padding
    num_batches_per_tarfile: 100 # Number of batches (pickle files) within each tarfile
    tar_shuffle_n: 100 # How many samples to look ahead and load to be shuffled
    shard_strategy: scatter # tarred dataset shard distribution strategy
    n_preproc_jobs: -2 # number of processes to use for data preprocessing (-2 means all but 2)
    tokens_in_batch: 2000
    clean: true
    max_seq_length: 512
    shuffle: true
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 16
    concat_sampling_technique: temperature # only used with ConcatTranslationDataset
    concat_sampling_temperature: 5 # only used with ConcatTranslationDataset
    concat_sampling_probabilities: null # only used with ConcatTranslationDataset

  validation_ds:
    src_file_name: [data/dev/dev.ac_dev.en, data/dev/newstest2012.en, data/dev/newstest2016.en, data/dev/test.IWSLT2017.en, data/dev/dev.opus-100-v1.en, data/dev/newstest2014.en, data/dev/test.IWSLT2016.en]
    tgt_file_name: [data/dev/dev.ac_dev.de, data/dev/newstest2012.de, data/dev/newstest2016.de, data/dev/test.IWSLT2017.de, data/dev/dev.opus-100-v1.de, data/dev/newstest2014.de, data/dev/test.IWSLT2016.de]
    tokens_in_batch: 2000
    clean: false
    max_seq_length: 512
    shuffle: false
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 16

  test_ds:
    src_file_name: [data/test/iwslt.tst2013.en-de.en, data/test/iwslt.tst2014.en-de.en, data/test/newstest2013.en, data/test/newstest2015.en, data/test/test.ac_test.en, data/test/test.opus-100-v1.en]
    tgt_file_name: [data/test/iwslt.tst2013.en-de.de, data/test/iwslt.tst2014.en-de.de, data/test/newstest2013.de, data/test/newstest2015.de, data/test/test.ac_test.de, data/test/test.opus-100-v1.de]
    tokens_in_batch: 2000
    clean: false
    max_seq_length: 512
    shuffle: false
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 16

  optim:
    name: adam
    lr: 0.001
    betas:
      - 0.9
      - 0.98
    weight_decay: 0.0
    sched:
      name: InverseSquareRootAnnealing
      min_lr: 0.0
      last_epoch: -1
      warmup_ratio: 0.1

  encoder_tokenizer:
    library: yttm
    tokenizer_model: data/train/bpe.32k.model
    vocab_size: 32000 # vocab size for training bpe
    bpe_dropout: 0.0
    vocab_file: null
    special_tokens: null
    training_sample_size: null # valid for sentencepiece tokenizer
    r2l: false

  decoder_tokenizer:
    library: yttm
    tokenizer_model: data/train/bpe.32k.model
    vocab_size: 32000 # vocab size for training bpe
    bpe_dropout: 0.0
    vocab_file: null
    special_tokens: null
    training_sample_size: null # valid for sentencepiece tokenizer
    r2l: false

  encoder:
    library: nemo
    model_name: null
    pretrained: false
    max_sequence_length: 512
    num_token_types: 0
    embedding_dropout: 0.1
    learn_positional_encodings: false
    hidden_size: 1024
    num_layers: 6
    inner_size: 4096
    num_attention_heads: 16
    ffn_dropout: 0.1
    attn_score_dropout: 0.1
    attn_layer_dropout: 0.1
    hidden_act: relu
    mask_future: false
    pre_ln: false
    pre_ln_final_layer_norm: true

  decoder:
    library: nemo
    model_name: null
    pretrained: false
    max_sequence_length: 512
    num_token_types: 0
    embedding_dropout: 0.1
    learn_positional_encodings: false
    hidden_size: 1024
    inner_size: 4096
    num_layers: 6
    num_attention_heads: 16
    ffn_dropout: 0.1
    attn_score_dropout: 0.1
    attn_layer_dropout: 0.1
    hidden_act: relu
    pre_ln: false
    pre_ln_final_layer_norm: true

  head:
    num_layers: 1
    activation: relu
    log_softmax: true
    dropout: 0.0
    use_transformer_init: true

trainer:
  devices: 4
  num_nodes: 1
  max_epochs: 200
  precision: 32 # Should be set to 16 for O1 and O2, default is 16 as PT ignores it when am_level is O0
  accelerator: gpu
  enable_checkpointing: False
  logger: false
  log_every_n_steps: 50  # Interval of logging.
  val_check_interval: 1000
  benchmark: false
  num_sanity_val_steps: 0

exp_manager:
  name: nemo_ende_6x6_0.1_2k
  files_to_copy: []
  exp_dir: exp
  create_wandb_logger: True
  wandb_logger_kwargs:
    project: nemo
    name: ende-6x6-0.1-2k
  create_checkpoint_callback: True
  checkpoint_callback_params:
    monitor: val_sacreBLEU
    mode: max
    save_top_k: 5

Environment details

  • OS version: CentOS
  • PyTorch version: 1.12.1
  • Python version: 3.9.0
  • NeMo version: Installed with pip from source using main branch very recently, not sure of the exact commit. Installed following the instructions in the GitHub Readme.

Additional context

Training on 4 NVIDIA T4 GPUs.

Numeri avatar Aug 25 '22 12:08 Numeri