Error while training with translation_multi_simple_epoch using custom datasets

What is your question?

Hello! Some months ago, I successfully used the previous version for multilingual translation with custom datasets. I recently noticed there's a new one so I wanted to test it using my same datasets. Unfortunately, I've come across an error that I hope you can help me with. I'm not sure if it's a bug or if I'm doing anything wrong.


For my purposes, I'm trying to train a one-to-many model for language pairs: "orig-simp,orig-para,orig-split,orig-comp". These are not really "languages", but monolingual English data for different text-to-text generation tasks.

This is the preprocessing step:

fairseq-preprocess --source-lang orig --target-lang simp \
                   --trainpref "${data_dir}/train.bpe.orig-simp" \
                   --validpref "${data_dir}/valid.bpe.orig-simp" \
                   --testpref "${data_dir}/test.bpe.orig-simp" \
                   --joined-dictionary \
                   --destdir "${data_dir}/bin" \
                   --workers 10

for tgt in "para" "split" "comp"; do
  fairseq-preprocess --source-lang orig --target-lang "${tgt}" \
                   --trainpref "${data_dir}/train.bpe.orig-${tgt}" \
                   --validpref "${data_dir}/valid.bpe.orig-${tgt}" \
                   --testpref "${data_dir}/test.bpe.orig-${tgt}" \
                   --joined-dictionary --srcdict "${data_dir}/bin/dict.orig.txt" \
                   --destdir "${data_dir}/bin" \
                   --workers 10

This is the training step (exactly the same as in the example in the repo):


CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train "${data_dir}/bin" \
  --encoder-normalize-before --decoder-normalize-before \
  --arch transformer --layernorm-embedding \
  --task translation_multi_simple_epoch \
  --sampling-method "temperature" \
  --sampling-temperature 1.5 \
  --encoder-langtok "src" \
  --decoder-langtok \
  --lang-dict "${lang_list}" \
  --lang-pairs "${lang_pairs}" \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
  --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
  --lr-scheduler inverse_sqrt --lr 3e-05 --min-lr -1 --warmup-updates 2500 --max-update 40000 \
  --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
  --max-tokens 1024 --update-freq 2 \
  --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
  --seed 222 --log-format simple --log-interval 2

In case it's necessary, this is the content of langs.txt:


When I ran the training command, I got the following error message before training began:

2020-09-14 20:37:36 | INFO | fairseq.trainer | begin training epoch 1
Traceback (most recent call last):
  File "/home/falva/anaconda3/envs/mtl4ts/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 350, in cli_main
    distributed_utils.call_main(args, main)
  File "/experiments/falva/tools/fairseq/fairseq/distributed_utils.py", line 240, in call_main
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/experiments/falva/tools/fairseq/fairseq/distributed_utils.py", line 224, in distributed_main
    main(args, **kwargs)
  File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 125, in main
    valid_losses, should_stop = train(args, trainer, task, epoch_itr)
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 203, in train
    for i, samples in enumerate(progress):
  File "/experiments/falva/tools/fairseq/fairseq/logging/progress_bar.py", line 245, in __iter__
    for i, obj in enumerate(self.iterable, start=self.n):
  File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 60, in __iter__
    for x in self.iterable:
  File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 425, in _chunk_iterator
    for x in itr:
  File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 60, in __iter__
    for x in self.iterable:
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/experiments/falva/tools/fairseq/fairseq/data/multilingual/sampled_multi_epoch_dataset.py", line 103, in __getitem__
    return super().__getitem__(i)
  File "/experiments/falva/tools/fairseq/fairseq/data/multilingual/sampled_multi_dataset.py", line 223, in __getitem__
    ret = (ds_idx, self.datasets[ds_idx][ds_sample_idx])
  File "/experiments/falva/tools/fairseq/fairseq/data/language_pair_dataset.py", line 264, in __getitem__
    tgt_item = self.tgt[index] if self.tgt is not None else None
  File "/experiments/falva/tools/fairseq/fairseq/data/prepend_token_dataset.py", line 23, in __getitem__
    item = self.dataset[idx]
  File "/experiments/falva/tools/fairseq/fairseq/data/indexed_dataset.py", line 230, in __getitem__
    ptx = self.cache_index[i]
KeyError: 984451

Any idea what could be the problem? Thanks!

What have you tried?

The error message is too general to find anything useful about it using google. I also tried to search the issues here but I was unsuccessful. So, I haven't been able to try anything in particular.

What's your environment?

  • fairseq Version: 0.9.0
  • PyTorch Version: 1.4.0
  • OS: Linux
  • How you installed fairseq: source
  • Build command you used: as in the instructions in README
  • Python version: 3.6
  • CUDA/cuDNN version: 10.0
  • GPU models and configuration:
  • Any other relevant information:

this bug usually caused by you use an old version preprocess.py to process the data then use the 0.10 version to train

@NonvolatileMemory Do you have any suggestions? Other than re-processing the data using v0.10 preprocess.py.

