fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Error while training with translation_multi_simple_epoch using custom datasets

Open feralvam opened this issue 3 years ago • 3 comments

What is your question?

Hello! Some months ago, I successfully used the previous version for multilingual translation with custom datasets. I recently noticed there's a new one so I wanted to test it using my same datasets. Unfortunately, I've come across an error that I hope you can help me with. I'm not sure if it's a bug or if I'm doing anything wrong.

Code

For my purposes, I'm trying to train a one-to-many model for language pairs: "orig-simp,orig-para,orig-split,orig-comp". These are not really "languages", but monolingual English data for different text-to-text generation tasks.

This is the preprocessing step:

fairseq-preprocess --source-lang orig --target-lang simp \
                   --trainpref "${data_dir}/train.bpe.orig-simp" \
                   --validpref "${data_dir}/valid.bpe.orig-simp" \
                   --testpref "${data_dir}/test.bpe.orig-simp" \
                   --joined-dictionary \
                   --destdir "${data_dir}/bin" \
                   --workers 10

for tgt in "para" "split" "comp"; do
  fairseq-preprocess --source-lang orig --target-lang "${tgt}" \
                   --trainpref "${data_dir}/train.bpe.orig-${tgt}" \
                   --validpref "${data_dir}/valid.bpe.orig-${tgt}" \
                   --testpref "${data_dir}/test.bpe.orig-${tgt}" \
                   --joined-dictionary --srcdict "${data_dir}/bin/dict.orig.txt" \
                   --destdir "${data_dir}/bin" \
                   --workers 10
done

This is the training step (exactly the same as in the example in the repo):

lang_pairs="orig-simp,orig-para,orig-split,orig-comp"
lang_list="${experiment_dir}/langs.txt"

CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train "${data_dir}/bin" \
  --encoder-normalize-before --decoder-normalize-before \
  --arch transformer --layernorm-embedding \
  --task translation_multi_simple_epoch \
  --sampling-method "temperature" \
  --sampling-temperature 1.5 \
  --encoder-langtok "src" \
  --decoder-langtok \
  --lang-dict "${lang_list}" \
  --lang-pairs "${lang_pairs}" \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
  --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
  --lr-scheduler inverse_sqrt --lr 3e-05 --min-lr -1 --warmup-updates 2500 --max-update 40000 \
  --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
  --max-tokens 1024 --update-freq 2 \
  --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
  --seed 222 --log-format simple --log-interval 2

In case it's necessary, this is the content of langs.txt:

orig
simp
para
split
comp

When I ran the training command, I got the following error message before training began:

2020-09-14 20:37:36 | INFO | fairseq.trainer | begin training epoch 1
Traceback (most recent call last):
  File "/home/falva/anaconda3/envs/mtl4ts/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 350, in cli_main
    distributed_utils.call_main(args, main)
  File "/experiments/falva/tools/fairseq/fairseq/distributed_utils.py", line 240, in call_main
    nprocs=args.distributed_num_procs,
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/experiments/falva/tools/fairseq/fairseq/distributed_utils.py", line 224, in distributed_main
    main(args, **kwargs)
  File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 125, in main
    valid_losses, should_stop = train(args, trainer, task, epoch_itr)
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 203, in train
    for i, samples in enumerate(progress):
  File "/experiments/falva/tools/fairseq/fairseq/logging/progress_bar.py", line 245, in __iter__
    for i, obj in enumerate(self.iterable, start=self.n):
  File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 60, in __iter__
    for x in self.iterable:
  File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 425, in _chunk_iterator
    for x in itr:
  File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 60, in __iter__
    for x in self.iterable:
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/experiments/falva/tools/fairseq/fairseq/data/multilingual/sampled_multi_epoch_dataset.py", line 103, in __getitem__
    return super().__getitem__(i)
  File "/experiments/falva/tools/fairseq/fairseq/data/multilingual/sampled_multi_dataset.py", line 223, in __getitem__
    ret = (ds_idx, self.datasets[ds_idx][ds_sample_idx])
  File "/experiments/falva/tools/fairseq/fairseq/data/language_pair_dataset.py", line 264, in __getitem__
    tgt_item = self.tgt[index] if self.tgt is not None else None
  File "/experiments/falva/tools/fairseq/fairseq/data/prepend_token_dataset.py", line 23, in __getitem__
    item = self.dataset[idx]
  File "/experiments/falva/tools/fairseq/fairseq/data/indexed_dataset.py", line 230, in __getitem__
    ptx = self.cache_index[i]
KeyError: 984451

Any idea what could be the problem? Thanks!

What have you tried?

The error message is too general to find anything useful about it using google. I also tried to search the issues here but I was unsuccessful. So, I haven't been able to try anything in particular.

What's your environment?

  • fairseq Version: 0.9.0
  • PyTorch Version: 1.4.0
  • OS: Linux
  • How you installed fairseq: source
  • Build command you used: as in the instructions in README
  • Python version: 3.6
  • CUDA/cuDNN version: 10.0
  • GPU models and configuration:
  • Any other relevant information:

feralvam avatar Sep 14 '20 20:09 feralvam

this bug usually caused by you use an old version preprocess.py to process the data then use the 0.10 version to train

NonvolatileMemory avatar Mar 04 '21 15:03 NonvolatileMemory

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] avatar Jun 16 '21 23:06 stale[bot]

@NonvolatileMemory Do you have any suggestions? Other than re-processing the data using v0.10 preprocess.py.

bigapple716 avatar Jul 05 '22 23:07 bigapple716