fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

The use of the ---bpe parameter in fairseq-proprecess

Open Old-Young233 opened this issue 2 years ago • 2 comments

We find that the vocabulary obtained with --bpe in fairseq-proprecess is the same regardless of being specified as [character/subword-nmt/fast_bpe etc.], when we use the following command.

fairseq-preprocess --source-lang mo --target-lang ch 
--trainpref data/bpe-dropout/train \
--validpref data/bpe-dropout/valid  \
--testpref data/bpe-dropout/test \\
--destdir data-bin/bpe-dropout/autovocab \
--bpe subword-nmt \
--nwordssrc 22000 \
--nwordstgt 30000

where Mongolian is not processed in any way and Chinese is tokennized, is the vocabulary generated by fairseq from the corpus word-level in this case? How is the --bpe parameter used?

We also found that when the corpus is learned and processed with bpe, the number of in binarization is greatly reduced compared to the previous processing, but the BLEU value is not as good as before, I would like to know why this is?

Old-Young233 avatar Jun 28 '22 11:06 Old-Young233

fairseq-preprocess does not utilize --bpe. That is why examples/translation uses a script of subword-nmt and moses tokenizer inside. bpe and tokenizer only work in generate and interactive. generate only uses decode , interactive uses encode and decode.

If you doubt this, you can commit a search bpe in fairseq_cli/preprocess.py and you find nothing, while you find it in generate.py and interactive.py and knowing how they are used.

gmryu avatar Jun 30 '22 14:06 gmryu

Thank you very much for your answer

Old-Young233 avatar Jul 02 '22 01:07 Old-Young233