fairseq
fairseq copied to clipboard
The use of the ---bpe parameter in fairseq-proprecess
We find that the vocabulary obtained with --bpe
in fairseq-proprecess is the same regardless of being specified as [character/subword-nmt/fast_bpe
etc.], when we use the following command.
fairseq-preprocess --source-lang mo --target-lang ch
--trainpref data/bpe-dropout/train \
--validpref data/bpe-dropout/valid \
--testpref data/bpe-dropout/test \\
--destdir data-bin/bpe-dropout/autovocab \
--bpe subword-nmt \
--nwordssrc 22000 \
--nwordstgt 30000
where Mongolian is not processed in any way and Chinese is tokennized, is the vocabulary generated by fairseq from the corpus word-level in this case? How is the --bpe parameter used?
We also found that when the corpus is learned and processed with bpe, the number of
fairseq-preprocess
does not utilize --bpe
.
That is why examples/translation uses a script of subword-nmt and moses tokenizer inside.
bpe and tokenizer only work in generate and interactive.
generate only uses decode , interactive uses encode and decode.
If you doubt this, you can commit a search bpe
in fairseq_cli/preprocess.py
and you find nothing,
while you find it in generate.py
and interactive.py
and knowing how they are used.
Thank you very much for your answer