NeMo
NeMo copied to clipboard
PyTorch KeyError for batch index with multiple validation sets
Describe the bug
Trying to train a transformer model for machine translation, following the examples in the docs. After many attempts, I managed to get it to start training, but during the first validation step, it successfully processes the first validation dataset, but throws a KeyError
exception for the second dataset (no matter which dataset comes second):
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/user/exp/nemo/ende/enc_dec_nmt.py", line 144, in main
raise err
File "/home/user/exp/nemo/ende/enc_dec_nmt.py", line 136, in main
trainer.fit(mt_model)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
self.fit_loop.run()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
self.on_advance_end()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
self._run_validation()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
self.val_loop.run()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
batch = next(data_fetcher)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
return self.fetching_function()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
batch = next(iterator)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 13.
Original Traceback (most recent call last):
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/user/miniconda3/envs/nemo/lib/python3.9/site-packages/nemo/collections/nlp/data/machine_translation/machine_translation_dataset.py", line 140,
in __getitem__
src_ids = self.batches[idx]["src"]
KeyError: 118
This seems to happen when I specify multiple validation datasets as a list. When specifying only one, training doesn't crash (or at least hasn't yet!). All train/validation/test files are in plain format (i.e. raw text).
Steps/Code to reproduce bug
This is my full YAML config.
name: nemo_ende_6x6_0.1_2k
do_training: true # set to false if only preprocessing data
do_testing: true # set to true to run evaluation on test data after training
model:
beam_size: 1
len_pen: 0.6
multilingual: false
max_generation_delta: -1
label_smoothing: 0.1
shared_tokenizer: true # train tokenizer model across src and tgt train data
preproc_out_dir: data_output # path to store data preprocessing outputs
src_language: 'en'
tgt_language: 'de'
shared_embeddings: false
train_ds:
src_file_name: data/train/train_5m.en
tgt_file_name: data/train/train_5m.de
use_tarred_dataset: true # if true tar_file_name and meta_file_name will be used (or created automatically)
# config for preprocessing training data and creating a tarred datset automatically
tar_file_prefix: train_5m # prefix for tar file names
tar_files: null # if data has already been preprocessed (rest of config ignored)
metadata_file: data_output/metadata.tokens.2000.json # metadata for tarred dataset
lines_per_dataset_fragment: 1000000 # Number of lines to consider for bucketing and padding
num_batches_per_tarfile: 100 # Number of batches (pickle files) within each tarfile
tar_shuffle_n: 100 # How many samples to look ahead and load to be shuffled
shard_strategy: scatter # tarred dataset shard distribution strategy
n_preproc_jobs: -2 # number of processes to use for data preprocessing (-2 means all but 2)
tokens_in_batch: 2000
clean: true
max_seq_length: 512
shuffle: true
num_samples: -1
drop_last: false
pin_memory: false
num_workers: 16
concat_sampling_technique: temperature # only used with ConcatTranslationDataset
concat_sampling_temperature: 5 # only used with ConcatTranslationDataset
concat_sampling_probabilities: null # only used with ConcatTranslationDataset
validation_ds:
src_file_name: [data/dev/dev.ac_dev.en, data/dev/newstest2012.en, data/dev/newstest2016.en, data/dev/test.IWSLT2017.en, data/dev/dev.opus-100-v1.en, data/dev/newstest2014.en, data/dev/test.IWSLT2016.en]
tgt_file_name: [data/dev/dev.ac_dev.de, data/dev/newstest2012.de, data/dev/newstest2016.de, data/dev/test.IWSLT2017.de, data/dev/dev.opus-100-v1.de, data/dev/newstest2014.de, data/dev/test.IWSLT2016.de]
tokens_in_batch: 2000
clean: false
max_seq_length: 512
shuffle: false
num_samples: -1
drop_last: false
pin_memory: false
num_workers: 16
test_ds:
src_file_name: [data/test/iwslt.tst2013.en-de.en, data/test/iwslt.tst2014.en-de.en, data/test/newstest2013.en, data/test/newstest2015.en, data/test/test.ac_test.en, data/test/test.opus-100-v1.en]
tgt_file_name: [data/test/iwslt.tst2013.en-de.de, data/test/iwslt.tst2014.en-de.de, data/test/newstest2013.de, data/test/newstest2015.de, data/test/test.ac_test.de, data/test/test.opus-100-v1.de]
tokens_in_batch: 2000
clean: false
max_seq_length: 512
shuffle: false
num_samples: -1
drop_last: false
pin_memory: false
num_workers: 16
optim:
name: adam
lr: 0.001
betas:
- 0.9
- 0.98
weight_decay: 0.0
sched:
name: InverseSquareRootAnnealing
min_lr: 0.0
last_epoch: -1
warmup_ratio: 0.1
encoder_tokenizer:
library: yttm
tokenizer_model: data/train/bpe.32k.model
vocab_size: 32000 # vocab size for training bpe
bpe_dropout: 0.0
vocab_file: null
special_tokens: null
training_sample_size: null # valid for sentencepiece tokenizer
r2l: false
decoder_tokenizer:
library: yttm
tokenizer_model: data/train/bpe.32k.model
vocab_size: 32000 # vocab size for training bpe
bpe_dropout: 0.0
vocab_file: null
special_tokens: null
training_sample_size: null # valid for sentencepiece tokenizer
r2l: false
encoder:
library: nemo
model_name: null
pretrained: false
max_sequence_length: 512
num_token_types: 0
embedding_dropout: 0.1
learn_positional_encodings: false
hidden_size: 1024
num_layers: 6
inner_size: 4096
num_attention_heads: 16
ffn_dropout: 0.1
attn_score_dropout: 0.1
attn_layer_dropout: 0.1
hidden_act: relu
mask_future: false
pre_ln: false
pre_ln_final_layer_norm: true
decoder:
library: nemo
model_name: null
pretrained: false
max_sequence_length: 512
num_token_types: 0
embedding_dropout: 0.1
learn_positional_encodings: false
hidden_size: 1024
inner_size: 4096
num_layers: 6
num_attention_heads: 16
ffn_dropout: 0.1
attn_score_dropout: 0.1
attn_layer_dropout: 0.1
hidden_act: relu
pre_ln: false
pre_ln_final_layer_norm: true
head:
num_layers: 1
activation: relu
log_softmax: true
dropout: 0.0
use_transformer_init: true
trainer:
devices: 4
num_nodes: 1
max_epochs: 200
precision: 32 # Should be set to 16 for O1 and O2, default is 16 as PT ignores it when am_level is O0
accelerator: gpu
enable_checkpointing: False
logger: false
log_every_n_steps: 50 # Interval of logging.
val_check_interval: 1000
benchmark: false
num_sanity_val_steps: 0
exp_manager:
name: nemo_ende_6x6_0.1_2k
files_to_copy: []
exp_dir: exp
create_wandb_logger: True
wandb_logger_kwargs:
project: nemo
name: ende-6x6-0.1-2k
create_checkpoint_callback: True
checkpoint_callback_params:
monitor: val_sacreBLEU
mode: max
save_top_k: 5
Environment details
- OS version: CentOS
- PyTorch version: 1.12.1
- Python version: 3.9.0
- NeMo version: Installed with pip from source using
main
branch very recently, not sure of the exact commit. Installed following the instructions in the GitHub Readme.
Additional context
Training on 4 NVIDIA T4 GPUs.