seamless_communication
seamless_communication copied to clipboard
Outputs too many <unk> symbols with Mandarin Chinese (cmn & cmn_Hant) and Cantonese (yue)
For example "Oh, Peter." translated to "<unk>,彼得.", "Oh, my god" translated to "<unk>,我的上帝"
Almost all of the "Oh" are translated into <unk>, making this project almost unusable for Chinese and Cantonese..
$ m4t_predict "Oh, Peter." t2tt cmn --src_lang eng
2023-09-20 03:22:20,215 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:22:24,949 INFO -- m4t_scripts.predict.predict: Translated text in cmn: <unk>,彼得.
$ m4t_predict "Oh, Peter." t2tt cmn_Hant --src_lang eng
2023-09-20 03:22:48,454 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:22:53,404 INFO -- m4t_scripts.predict.predict: Translated text in cmn_Hant: <unk>, 彼得.
$ m4t_predict "Oh, Peter." t2tt yue --src_lang eng
2023-09-20 03:21:16,073 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:21:20,886 INFO -- m4t_scripts.predict.predict: Translated text in yue: <unk>,彼得.
#152 #64
Confirmed also on my end with SeamlessM4TLarge model.
English input:
A witch can fly with a broom.
Chinese output:
一个女巫可以用扫<unk>飞.
Seamless t2t is based on nllb but nllb does not have this issue.
@cndn you pushed a readme update at https://github.com/facebookresearch/seamless_communication/commit/df2816adf3574016ffa99eb947ec3bff23310413 but the diff changes is related to audio alignment while this issue is text2text translation. Can you provide an example for we can fix this for t2t? Thanks.
Same issue here on v2 model T2TT...How can the model be predicting a special token in the first place?
This does not happen on the HF space demo, and I am using the same code as the demo...
It only happens for some instances of Oh
for me
me too
how can i fix this problem?
@skywindy I believe the tokenizer mapping in the opensourced repo is completely wrong. Thus creating this UNK issues. Either that or they trained with a broken tokenizer. This is no way for us to fix this without knowing if model or tokenizer is broken.