marian icon indicating copy to clipboard operation
marian copied to clipboard

How does providing a vocabulary affect training BPE models?

Open hc09141 opened this issue 5 years ago • 6 comments

I am currently training a transformer model and have followed the MTM labs to apply BPE to my own corpus. However, I'm unsure of the effect that providing a pre-determined vocabulary has. Does it impact the training process, or does it just limit the tokens that can be produced at the decoding step (e.g. for validation or translation)?

hc09141 avatar May 26 '19 17:05 hc09141

Subword segmentation helps translating rare and out-of-vocabulary words. See, for example, https://www.aclweb.org/anthology/P16-1162

snukky avatar May 26 '19 19:05 snukky

Hi @snukky, thanks I understand how BPE works and its purpose, I was just wondering if you provide a limited vocabulary (of full words using the -v flag) to marian during the training process, how/if that changes the training process?

hc09141 avatar May 27 '19 07:05 hc09141

All out-of-vocabulary words become <unk> tokens, so using a vocabulary of top K most frequent words instead of subword units increases the frequency of <unk>s. There is no change in how the training works.

If no existing vocabulary is provided with -v, vocabularies of 32k tokens are automatically created from the training corpora.

If Marian is compiled with SentencePiece and the vocab path ends with .spm, Marian will handle subword segmentation and de-segmentation internally so raw files can be provided for training/validation/decoding (more details: https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).

snukky avatar May 27 '19 08:05 snukky

To add to this, if you don't provide a vocabulary, Marian will create the file for you. The created file is the equivalent of this:

LC_AL=C cat data | tr ' ' '\n' | sort -u

plus added </s> and <unk>, and written out as a Yaml file. Nothing else.

I do not recommend using this feature, as, as of last time I ran this, the generated vocabularies are non-deterministic--running this twice can produce differently sorted vocabularies (for words with the same frequency, the assigned word ids are permutated randomly). Also, the vocabs are written as UTF-8 Yaml files without validating the UTF-8, so if your corpus contains invalid UTF-8, you end up with mal-formed Yaml that cannot, e.g., be loaded by Python. I always create my own vocabs, and write them out as plain-text word lists (one token per line, no syntax, no counts). Marian supports this.

frankseide avatar May 28 '19 15:05 frankseide

Can we add the word itself as a second sort criterion (or even just make the sort stable) to make it deterministic?

kpu avatar May 28 '19 15:05 kpu

That should solve it.

frankseide avatar May 28 '19 15:05 frankseide