argos-translate icon indicating copy to clipboard operation
argos-translate copied to clipboard

Doesn't understand Chinese

Open NRHGDW opened this issue 2 years ago • 9 comments

Hello. image

How are you? image

I'm too/very tired image

NRHGDW avatar Jun 16 '22 00:06 NRHGDW

Yes, you are right. idiot

DSPerson avatar Jun 22 '22 03:06 DSPerson

Yes the Chinese translations aren't very good. I think the root cause is that there isn't very much data available for Chinese.

PJ-Finlay avatar Jul 03 '22 16:07 PJ-Finlay

Looks like there's much data for Chinese. https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/README-v2021-08-07.md

Can someone train an argos package with it please? I really need good Chinese to English translation

rafael3382 avatar Apr 01 '23 17:04 rafael3382

Maybe you can help us train a better Chinese model @rafael3382 see https://github.com/argosopentech/argos-train

pierotofy avatar Apr 01 '23 18:04 pierotofy

The Chinese model was updated recently hopefully the new one is better.

https://community.libretranslate.com/t/improving-chinese-translations/364/

If we can find more data we could retrain again too.

PJ-Finlay avatar Apr 02 '23 16:04 PJ-Finlay

still bad. How many GPU cards need if I want to train it?

BackMountainDevil avatar Aug 03 '23 03:08 BackMountainDevil

https://huggingface.co/Helsinki-NLP/opus-mt-zh-en does a pretty good job, I wonder if we can use that.

# pip install torch
# pip install sentencepiece
# pip install sacremoses

from transformers import MarianMTModel, MarianTokenizer

def chinese_to_english(text):
    model_name = 'Helsinki-NLP/opus-mt-zh-en'
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)

    # Tokenize the text
    tokenized_text = tokenizer.encode(text, return_tensors="pt")

    # Translate the tokenized text
    translated_tokens = model.generate(tokenized_text)

    # Decode the translated tokens to a string
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return translated_text

if __name__ == "__main__":
    chinese_text = input("Enter Chinese text: ")
    translated_text = chinese_to_english(chinese_text)
    print(f"Translated Text: {translated_text}")```

mkunz7 avatar Oct 04 '23 00:10 mkunz7

New Chinese simplified/traditional models (from OPUS-MT) are up:

https://libretranslate.com/?source=zh&target=en&q=%E4%BD%A0%E5%A5%BD

How do they score?

Link to models thread: https://community.libretranslate.com/t/opus-mt-language-models-port-thread/757/2

pierotofy avatar Oct 19 '23 18:10 pierotofy

Thanks @pierotofy! After reinstalling not only zh but also pl is now working great : )

gkielian avatar Oct 27 '23 22:10 gkielian