seamless_communication Outputs too many <unk> symbols with Mandarin Chinese (cmn & cmn

For example "Oh, Peter." translated to "<unk>,彼得.", "Oh, my god" translated to "<unk>,我的上帝"

Almost all of the "Oh" are translated into <unk>, making this project almost unusable for Chinese and Cantonese..

$ m4t_predict "Oh, Peter."  t2tt cmn --src_lang eng
2023-09-20 03:22:20,215 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:22:24,949 INFO -- m4t_scripts.predict.predict: Translated text in cmn: <unk>,彼得.

$ m4t_predict "Oh, Peter."  t2tt cmn_Hant --src_lang eng
2023-09-20 03:22:48,454 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:22:53,404 INFO -- m4t_scripts.predict.predict: Translated text in cmn_Hant: <unk>, 彼得.

$ m4t_predict "Oh, Peter."  t2tt yue --src_lang eng
2023-09-20 03:21:16,073 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:21:20,886 INFO -- m4t_scripts.predict.predict: Translated text in yue: <unk>,彼得.

#152 #64

Sep 19 '23 19:09 tanshuai

Confirmed also on my end with SeamlessM4TLarge model.

English input:

A witch can fly with a broom.

Chinese output:

一个女巫可以用扫<unk>飞.

Seamless t2t is based on nllb but nllb does not have this issue.

Jan 26 '24 06:01 Qubitium

@cndn you pushed a readme update at https://github.com/facebookresearch/seamless_communication/commit/df2816adf3574016ffa99eb947ec3bff23310413 but the diff changes is related to audio alignment while this issue is text2text translation. Can you provide an example for we can fix this for t2t? Thanks.

Jan 26 '24 06:01 Qubitium

Same issue here on v2 model T2TT...How can the model be predicting a special token in the first place? This does not happen on the HF space demo, and I am using the same code as the demo... It only happens for some instances of Oh for me

Mar 29 '24 09:03 aliencaocao

me too

Mar 30 '24 15:03 asulada

how can i fix this problem?

Jun 20 '24 08:06 skywindy

@skywindy I believe the tokenizer mapping in the opensourced repo is completely wrong. Thus creating this UNK issues. Either that or they trained with a broken tokenizer. This is no way for us to fix this without knowing if model or tokenizer is broken.

Jun 20 '24 08:06 Qubitium

seamless_communication
seamless_communication copied to clipboard

Outputs too many <unk> symbols with Mandarin Chinese (cmn & cmn_Hant) and Cantonese (yue)

seamless_communication seamless_communication copied to clipboard

Outputs too many <unk> symbols with Mandarin Chinese (cmn & cmn_Hant) and Cantonese (yue)

seamless_communication
seamless_communication copied to clipboard