fairseq
fairseq copied to clipboard
Error while training with translation_multi_simple_epoch using custom datasets
What is your question?
Hello! Some months ago, I successfully used the previous version for multilingual translation with custom datasets. I recently noticed there's a new one so I wanted to test it using my same datasets. Unfortunately, I've come across an error that I hope you can help me with. I'm not sure if it's a bug or if I'm doing anything wrong.
Code
For my purposes, I'm trying to train a one-to-many model for language pairs: "orig-simp,orig-para,orig-split,orig-comp". These are not really "languages", but monolingual English data for different text-to-text generation tasks.
This is the preprocessing step:
fairseq-preprocess --source-lang orig --target-lang simp \
--trainpref "${data_dir}/train.bpe.orig-simp" \
--validpref "${data_dir}/valid.bpe.orig-simp" \
--testpref "${data_dir}/test.bpe.orig-simp" \
--joined-dictionary \
--destdir "${data_dir}/bin" \
--workers 10
for tgt in "para" "split" "comp"; do
fairseq-preprocess --source-lang orig --target-lang "${tgt}" \
--trainpref "${data_dir}/train.bpe.orig-${tgt}" \
--validpref "${data_dir}/valid.bpe.orig-${tgt}" \
--testpref "${data_dir}/test.bpe.orig-${tgt}" \
--joined-dictionary --srcdict "${data_dir}/bin/dict.orig.txt" \
--destdir "${data_dir}/bin" \
--workers 10
done
This is the training step (exactly the same as in the example in the repo):
lang_pairs="orig-simp,orig-para,orig-split,orig-comp"
lang_list="${experiment_dir}/langs.txt"
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train "${data_dir}/bin" \
--encoder-normalize-before --decoder-normalize-before \
--arch transformer --layernorm-embedding \
--task translation_multi_simple_epoch \
--sampling-method "temperature" \
--sampling-temperature 1.5 \
--encoder-langtok "src" \
--decoder-langtok \
--lang-dict "${lang_list}" \
--lang-pairs "${lang_pairs}" \
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --lr 3e-05 --min-lr -1 --warmup-updates 2500 --max-update 40000 \
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
--max-tokens 1024 --update-freq 2 \
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
--seed 222 --log-format simple --log-interval 2
In case it's necessary, this is the content of langs.txt
:
orig
simp
para
split
comp
When I ran the training command, I got the following error message before training began:
2020-09-14 20:37:36 | INFO | fairseq.trainer | begin training epoch 1
Traceback (most recent call last):
File "/home/falva/anaconda3/envs/mtl4ts/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 350, in cli_main
distributed_utils.call_main(args, main)
File "/experiments/falva/tools/fairseq/fairseq/distributed_utils.py", line 240, in call_main
nprocs=args.distributed_num_procs,
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/experiments/falva/tools/fairseq/fairseq/distributed_utils.py", line 224, in distributed_main
main(args, **kwargs)
File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 125, in main
valid_losses, should_stop = train(args, trainer, task, epoch_itr)
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/experiments/falva/tools/fairseq/fairseq_cli/train.py", line 203, in train
for i, samples in enumerate(progress):
File "/experiments/falva/tools/fairseq/fairseq/logging/progress_bar.py", line 245, in __iter__
for i, obj in enumerate(self.iterable, start=self.n):
File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 60, in __iter__
for x in self.iterable:
File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 425, in _chunk_iterator
for x in itr:
File "/experiments/falva/tools/fairseq/fairseq/data/iterators.py", line 60, in __iter__
for x in self.iterable:
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/falva/anaconda3/envs/mtl4ts/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/experiments/falva/tools/fairseq/fairseq/data/multilingual/sampled_multi_epoch_dataset.py", line 103, in __getitem__
return super().__getitem__(i)
File "/experiments/falva/tools/fairseq/fairseq/data/multilingual/sampled_multi_dataset.py", line 223, in __getitem__
ret = (ds_idx, self.datasets[ds_idx][ds_sample_idx])
File "/experiments/falva/tools/fairseq/fairseq/data/language_pair_dataset.py", line 264, in __getitem__
tgt_item = self.tgt[index] if self.tgt is not None else None
File "/experiments/falva/tools/fairseq/fairseq/data/prepend_token_dataset.py", line 23, in __getitem__
item = self.dataset[idx]
File "/experiments/falva/tools/fairseq/fairseq/data/indexed_dataset.py", line 230, in __getitem__
ptx = self.cache_index[i]
KeyError: 984451
Any idea what could be the problem? Thanks!
What have you tried?
The error message is too general to find anything useful about it using google. I also tried to search the issues here but I was unsuccessful. So, I haven't been able to try anything in particular.
What's your environment?
- fairseq Version: 0.9.0
- PyTorch Version: 1.4.0
- OS: Linux
- How you installed fairseq: source
- Build command you used: as in the instructions in README
- Python version: 3.6
- CUDA/cuDNN version: 10.0
- GPU models and configuration:
- Any other relevant information:
this bug usually caused by you use an old version preprocess.py to process the data then use the 0.10 version to train
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
@NonvolatileMemory Do you have any suggestions? Other than re-processing the data using v0.10 preprocess.py.