fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

NLLB vocabulary missing common Chinese character/tokens

Open pluiez opened this issue 1 year ago • 6 comments

Hi, I use the released NLLB checkpoint to decode flroes Chinese testset, overall the results looks good. However, I found that a lot of very common Chinese characters/tokens are missing from the dictionary, leading to those words never generated from other languages to Chinese and OOV tokens when translating from Chinese to other languages.

For example, "The eagle catches the chickens" translates to "老鹰捉小鸡" in Chinese, but NLLB model generates "▁ <unk> 抓 住 了 <unk>" since tokens for the two species are absent from the dictionary. This can be a practical problem, because the missing tokens are absolutely common in real-world situation.

The following is part of the tokens highly-frequent but missing from the dictionary:

饱
畅
湍
滩
岭
舱
诩
阔
荫
鸽
勋
鸡
鹰
裙
艳
哦
毋庸
稻
蔗
熔
亥
裤
氢
《
》
...

pluiez avatar Jul 09 '22 16:07 pluiez

I thought nllb uses a byte-level sentencepiece. Am I wrong? Is the dict you talked about is this https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/dictionary.txt ?

Since it is a byte-level dictionary, there is no actual word/character inside. They are meant to be decoded back to normal strings later. So I think those are generated just by chance.

gmryu avatar Jul 09 '22 17:07 gmryu

I thought nllb uses a byte-level sentencepiece. Am I wrong? Is the dict you talked about is this https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/dictionary.txt ?

Since it is a byte-level dictionary, there is no actual word/character inside. They are meant to be decoded back to normal strings later. So I think those are generated just by chance.

Yes, I did use this dict. The sentencepiece model and translation model dictionary are downloaded from https://github.com/facebookresearch/fairseq/tree/nllb#preparing-datasets-for-training

Here is the translation output from Chinese to English on flores devtest, there are a total of 447 <unk>s in source language across 1012 sentences.

log.flores-test.checkpoint.NLLB-200-Distilled-600M.zh2en.txt

pluiez avatar Jul 09 '22 17:07 pluiez

Confirmed. The downloaded dictionary.txt does not have all byte chars. So there are actually a lot of words/characters considered as <unk>.

I inspected the original dictionary with more logger.info inside fairseq/data/dictionary.py: (well, I wrote a expansion code of this dictionary to support unk2byte char.

original: 老鹰捉小鸡
after spm: ▁老 鹰 捉 小 鸡
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | ▁老 鹰 捉 小 
鸡
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | 鹰 not found 
in self.indices
# Since it is not found, I transfer 鹰 to its equivalent byte char string.
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | 鹰 := é¹°    
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | 鸡 not found 
in self.indices
# 鸡 to its equivalent byte char string.
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | 鸡 := 鸡    
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | tensor([230393, 248132,      3, 250174, 252996, 250014, 248132,      3, 249934, 
             2], dtype=torch.int32)

decoded: ▁老 é <unk> ° 捉 小 é <unk> ¡
bchar converted: ▁老<unk>捉小<unk>

There are two "3" appear in the tensor, which means byte char "¹" and "¸" does not exist inside the downloaded dictionary.txt as well. In sum there are 36 byte chars missing in the dictionary.txt,

p.s. you can use the tensor to find corresponding tokens inside the dictionary.txt. (open dictionrary.txt with uft-8) ▁老 is the first element = 230393 = (line number) 230390 - 1 (id starts from 0) + 4 (starts from bos,pad,eos,unk) = 230390 +3 So go to line 230390 of dictionary.txt. é is 248132, so go to line 248132 - 3 = 248129. ( it applies to almost fairseq dictionary and its dict.txt)

--

Well, adding those 36 byte chars into the dictionary.txt does not instantly fix the problem. Since the pretrained model's input dim is already decided, and you need to write a new fairseq dictionary.py to convert unknown word/char to byte char string before finishing dict encoding.

gmryu avatar Jul 10 '22 04:07 gmryu

Thank you for your nice explanation! Does this mean that the model may need fintuning on a extended vocabulary including the missing byte chars to fix this problem?

pluiez avatar Jul 10 '22 06:07 pluiez

I would wonder how the authors deal with those unknonw words. It feels like a huge hole and they would not have overlooked this.


In my case, I expanded fairseq/data/dictionary.py to overwrite least used tokens in the dict.txt with missing byte chars and put unknown words into byte char string. With these 2 setup, there are no more <unk>. The vocab count is kept the same. Probably this is the minimum changes done to the pretrained model. It would be a disaster if those least used tokens are what you want most. However this time it is 36 tokens, which may be ignored. Then, yes you can finetune the model for your case without worrying any unknown words/symbols.

gmryu avatar Jul 11 '22 11:07 gmryu

Other character base languages, such as Japanese, may have similar problems too.

BrightXiaoHan avatar Jul 18 '22 10:07 BrightXiaoHan

@huihuifan @edunov do you guys know if it could be possible to "update" the vocab for CJK and continue training so that those issues might be fixed ?

https://discuss.huggingface.co/t/nllb-3-3b-poor-translations-from-chinese-to-english/27695

vince62s avatar Jan 17 '23 09:01 vince62s

@pluiez I am working on a curated version of NLLB-200 to include these 26 symbols. Are you sure there are no other missing symbols for Chinese/Japanese/Korean ?

vince62s avatar Mar 22 '23 17:03 vince62s