missing keys for some low-resource language pairs

Open sshleifer opened this issue 5 years ago • 1 comments

On the ber-es transformer, if I run:

spm_encode --model source.spm <<< "Bessif kanay."

you get:

▁Be ssif ▁kan ▁ay .

But ▁Be is not in opus.spm32k-spm32k.vocab.yml, so my python tokenizer raises a KeyError when it encounters these tokens.

This doesn't change if I run preprocess.sh first. When I run the pieced sequence through marian_decoder I get a good translation, no error.

This happens for other model,character combos, here is a list of (pair, missing key) from a random sample of models I tested.

{'ha-en': '|',
 'ber-es': '▁Be',
 'pis-fi': '▁|',
 'es-mt': '|',
 'fr-he': '₫',
 'niu-sv': 'OGI',
 'fi-fse': '▁rentou',
 'fi-mh': '|',
 'hr-es': '|',
 'fr-ber': '▁devr',
 'ase-en': 'olos',
 'sv-uk': '|'}

Is this expected? Should my encoder use the id in these cases?

May 06 '20 18:05 sshleifer

This is indeed strange. I don't really know what is happening in those cases. I need to investigate this. The only reason I can think of is that the sentencepiece model is trained on different data than the ones I use for creating the vocabulary. That can, indeed, happen as I leave the sentencepiece model constant even if I augment the data, but the basic data set should always be included. I have no immediate answer on that ...

May 07 '20 14:05 jorgtied