missing keys for some low-resource language pairs
On the ber-es transformer, if I run:
spm_encode --model source.spm <<< "Bessif kanay."
you get:
▁Be ssif ▁kan ▁ay .
But ▁Be is not in opus.spm32k-spm32k.vocab.yml, so my python tokenizer raises a KeyError when it encounters these tokens.
This doesn't change if I run preprocess.sh first.
When I run the pieced sequence through marian_decoder I get a good translation, no error.
This happens for other model,character combos, here is a list of (pair, missing key) from a random sample of models I tested.
{'ha-en': '|',
'ber-es': '▁Be',
'pis-fi': '▁|',
'es-mt': '|',
'fr-he': '₫',
'niu-sv': 'OGI',
'fi-fse': '▁rentou',
'fi-mh': '|',
'hr-es': '|',
'fr-ber': '▁devr',
'ase-en': 'olos',
'sv-uk': '|'}
Is this expected? Should my encoder use the
This is indeed strange. I don't really know what is happening in those cases. I need to investigate this. The only reason I can think of is that the sentencepiece model is trained on different data than the ones I use for creating the vocabulary. That can, indeed, happen as I leave the sentencepiece model constant even if I augment the data, but the basic data set should always be included. I have no immediate answer on that ...