marian
marian copied to clipboard
How does providing a vocabulary affect training BPE models?
I am currently training a transformer model and have followed the MTM labs to apply BPE to my own corpus. However, I'm unsure of the effect that providing a pre-determined vocabulary has. Does it impact the training process, or does it just limit the tokens that can be produced at the decoding step (e.g. for validation or translation)?
Subword segmentation helps translating rare and out-of-vocabulary words. See, for example, https://www.aclweb.org/anthology/P16-1162
Hi @snukky, thanks I understand how BPE works and its purpose, I was just wondering if you provide a limited vocabulary (of full words using the -v flag) to marian during the training process, how/if that changes the training process?
All out-of-vocabulary words become <unk>
tokens, so using a vocabulary of top K most frequent words instead of subword units increases the frequency of <unk>
s. There is no change in how the training works.
If no existing vocabulary is provided with -v
, vocabularies of 32k tokens are automatically created from the training corpora.
If Marian is compiled with SentencePiece and the vocab path ends with .spm
, Marian will handle subword segmentation and de-segmentation internally so raw files can be provided for training/validation/decoding (more details: https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
To add to this, if you don't provide a vocabulary, Marian will create the file for you. The created file is the equivalent of this:
LC_AL=C cat data | tr ' ' '\n' | sort -u
plus added </s>
and <unk>
, and written out as a Yaml file. Nothing else.
I do not recommend using this feature, as, as of last time I ran this, the generated vocabularies are non-deterministic--running this twice can produce differently sorted vocabularies (for words with the same frequency, the assigned word ids are permutated randomly). Also, the vocabs are written as UTF-8 Yaml files without validating the UTF-8, so if your corpus contains invalid UTF-8, you end up with mal-formed Yaml that cannot, e.g., be loaded by Python. I always create my own vocabs, and write them out as plain-text word lists (one token per line, no syntax, no counts). Marian supports this.
Can we add the word itself as a second sort criterion (or even just make the sort stable) to make it deterministic?
That should solve it.