cannot translate the whole paragraph/sentences
🐛 Bug
When translating eng_Latn to zho_Hant, there are always missing parts to be translated. It doesn't happen in Zho_hans. Evan yue_hant is better than zho_hant.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
import ctranslate2 import transformers
src_lang = "eng_Latn" tgt_lang = "zho_Hant"
translator = ctranslate2.Translator("nllb-200-distilled-600M") tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=src_lang)
content = """ A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence). """ source = tokenizer.convert_ids_to_tokens(tokenizer.encode(content)) target_prefix = [tgt_lang] results = translator.translate_batch([source], target_prefix=[target_prefix]) target = results[0].hypotheses[0][1:]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
in this case, "A database of Chinese surnames and Chinese given names (1930-2008)." is not translated. The same issue happened if using transformers only.
It happens for nllb-200-distilled-1.3B too
I have the same problem, the model just drops sentences I tried nllb-200-1.3B and nllb-200-distilled-600M, same result. Sometimes, if I delete spaces between sentences and change capital letters to small letters it starts to translate the whole text, but it is not a solution.
All the NLLB models were trained mostly on single-sentence translation, and they are by no means guaranteed to correctly translate multiple-sentence texts.
Thus, the safest recommendation is to split the input text into sentences, translate them independently from each other, and concatenate the translations into the resulting text.
Here is the code that was used during the text preprocessing for NLLB for splitting documents into sentences for various languages: https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/utils/sentence_split.py.
An example of its usage is below:
!pip install stopes[mono] botok khmer-nltk laonlp
# botok is a dependency of stopes used for splitting Tibetan languages
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from stopes.pipelines.monolingual.utils.sentence_split import get_split_algo
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
content = """
A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).
"""
inputs = tokenizer(content, return_tensors="pt")
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("zho_Hant"), num_beams=4,
)
for translated in tokenizer.batch_decode(translated_tokens, skip_special_tokens=True):
print(translated)
# the output is significantly shortened
# 這個資料庫包含1,806個中國姓氏和 2,614個中文字符在姓氏中使用的全國頻率統計,覆蓋約120億漢族中文人口 (96.8%的漢族家庭注冊人口出生於 1930 年至 2008 年,仍活在 2008 年). 這個包還包含計算中文姓氏和中文姓氏的多個特征的功能,用于科學研究 (例如,姓氏獨特性,姓氏性別,姓氏溫度/能力).
# now split the content into individual sentences, just as NLLB was supposed to work!
splitter = get_split_algo("eng", "default")
input_sentences = list(splitter(content))
print(len(input_sentences)) # 3
inputs = tokenizer(input_sentences, return_tensors="pt", padding=True)
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("zho_Hant"), num_beams=4,
)
for translated in tokenizer.batch_decode(translated_tokens, skip_special_tokens=True):
print(translated)
# 數據庫中文姓氏和中文姓氏 (1930-2008).
# 這個數據庫包含全國 1,806 個中國姓氏和 2,614 個中文字符的頻率統計,覆蓋約120 億漢族中文人口 (96.8% 的漢族中家庭登記人口出生於 1930 年至 2008 年,仍活在 2008 年).
# 這套裝還包含計算中文姓氏和科學研究中文姓氏的多個特征的功能 (例如,名稱獨特性,名稱性別,名稱值和名稱溫度/能力).
See also similar discussions in the Seamless Communication repo.