fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

how to fix the KeyError: 'prev_output_tokens'

Open tianshuailu opened this issue 1 year ago • 3 comments

❓ Questions and Help

I was using fairseq-generate to generate en-ml back translation pairs but got an KeyError. Is it because of the preprocessing of the data?

The command for preprocessing and generating:

fairseq-preprocess \ --source-lang en_XX \ --target-lang ml_XX \ --only-source \ --srcdict /srv/scratch3/ltian/new/bt_nmt2/en4/dict.txt \ --tgtdict /srv/scratch3/ltian/new/bt_nmt2/en4/dict.txt \ --testpref /srv/scratch3/ltian/new/bt_nmt2/en4/train4.spm \ --destdir /srv/scratch3/ltian/new/bt_nmt2/data_bt4 \ --workers 70

fairseq-generate /srv/scratch3/ltian/new/bt_nmt2/data_bt1 \ --path /srv/scratch3/ltian/new/nmt1.pt \ --results-path /srv/scratch3/ltian/new/bt_nmt2/result1 \ --task translation_from_pretrained_bart \ --gen-subset test \ -t ml_XX -s en_XX \ --scoring sacrebleu \ --bpe 'sentencepiece' --sentencepiece-model /srv/scratch3/ltian/sentence.bpe.model \ --scoring sacrebleu \ --batch-size 32 --langs ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,ml_XX,no_ML,no_HI

The error message:

Traceback (most recent call last): File "/home/user/ltian/anaconda3/bin/fairseq-generate", line 33, in sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')()) File "/srv/scratch3/ltian/NMT-Adapt/NMTAdapt1/fairseq_cli/generate.py", line 320, in cli_main main(args) File "/srv/scratch3/ltian/NMT-Adapt/NMTAdapt1/fairseq_cli/generate.py", line 38, in main return _main(args, h) File "/srv/scratch3/ltian/NMT-Adapt/NMTAdapt1/fairseq_cli/generate.py", line 176, in _main for sample in progress: File "/home/user/ltian/anaconda3/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1022, in iter for obj in iterable: File "/srv/scratch3/ltian/NMT-Adapt/NMTAdapt1/fairseq/data/iterators.py", line 60, in iter for x in self.iterable: File "/srv/scratch3/ltian/NMT-Adapt/NMTAdapt1/fairseq/data/iterators.py", line 559, in next raise item File "/srv/scratch3/ltian/NMT-Adapt/NMTAdapt1/fairseq/data/iterators.py", line 493, in run for item in self._source: File "/home/user/ltian/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/user/ltian/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/user/ltian/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/home/user/ltian/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise raise self.exc_type(msg) KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/user/ltian/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/user/ltian/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/srv/scratch3/ltian/NMT-Adapt/NMTAdapt1/fairseq/data/language_pair_dataset.py", line 388, in collater pad_to_multiple=self.pad_to_multiple, File "/srv/scratch3/ltian/NMT-Adapt/NMTAdapt1/fairseq/data/language_pair_dataset.py", line 155, in collate batch['net_input']['prev_output_tokens'][:,:-1]=batch['net_input']['prev_output_tokens'][:,1:].clone().detach() KeyError: 'prev_output_tokens'

Any comments would be much appreciated!

tianshuailu avatar Aug 27 '22 15:08 tianshuailu

prev_output_tokens are created from target. So I guess --only-source forbid it. ( I mean generate writes the ideal target as well. It needs target)

By the way, prev_output_tokens is used only when you force decoding --prefix-size N. By default it is 0. The error is raised because dataset collate them anyway disregarding its usage.

gmryu avatar Aug 29 '22 13:08 gmryu

Hi,

I get the same error, but I don't specify --only-source or --prefix-size N anywhere. I checked the dictionary passed to forward, and 'target' is None, so I guess that's what it's complaining about. Any chance you could spot if I'm passing in something that could cause this? I use these arguments:

python3 $FAIRSEQ/fairseq_cli/train.py data-bin
--langs $langs
--source-lang $SRC --target-lang $TGT
--log-format simple
--log-interval 20
--seed 222
--criterion label_smoothed_cross_entropy
--label-smoothing 0.2
--optimizer adam
--adam-eps 1e-06
--adam-betas "(0.9, 0.98)"
--weight-decay 0.0
--lr-scheduler polynomial_decay
--task translation_from_pretrained_bart
--eval-bleu --eval-bleu-detok moses
--num-workers 8
--max-tokens 512
--validate-interval 1
--arch mbart_large
--max-update 150000
--update-freq 8
--lr 3e-05
--min-lr -1
--restore-file checkpoint_last.pt
--save-interval 1
--save-interval-updates 500
--keep-interval-updates 1
--no-epoch-checkpoints
--warmup-updates 2500
--dropout 0.3
--attention-dropout 0.1
--relu-dropout 0.0
--layernorm-embedding
--encoder-learned-pos
--decoder-learned-pos
--encoder-normalize-before
--decoder-normalize-before
--skip-invalid-size-inputs-valid-test
--share-all-embeddings
--finetune-from-mbart-at $MBART
--only-finetune-cross-attn
--patience 25 \

theamato avatar Nov 18 '22 13:11 theamato

@theamato It is not "train". It happens in "preprocess". You need to check how you prepared your data.

gmryu avatar Nov 21 '22 00:11 gmryu