parsinlu
parsinlu copied to clipboard
Machine translation - long sentences cause incomplete translation
I'm translating English sentences into Farsi with mt5-base-parsinlu-translation_en_fa (from Huggingface). Sentences longer than around 8 words result in the translation of the first part of the sentence, but the rest of the sentence is ignored. For example:
English sentences:
Terry's side fell to their second Premier League loss of the season at Loftus Road
Following a four-day hiatus, UN envoy Ismail Ould Cheikh Ahmed on Thursday will resume mediation efforts in the second round of Kuwait-hosted peace talks between Yemen’s warring rivals.
Mark Woods is a writer and broadcaster who has covered the NBA, and British basketball, for over a decade.
Translations:
طرفدار تری در فوتبال دوم فصل در لئوپوس رود به
پس از چهار روز توقف، سفیر سازمان ملل، ایمیل اولد شیخ
مارک ولز نویسنده و پخش کننده ای است که بیش از یک دهه
which according to Google Translate translates back to this:
More fans in the second football season in Leopard
After a four-day hiatus, the ambassador to the United Nations, Old Sheikh Sheikh
Mark Wells has been a writer and broadcaster for over a decade
I can't find any configuration settings that would be limiting the number of tokens being translated Here is my code:
#!/usr/bin/python3
import sys
#from transformers import MarianTokenizer, MarianMTModel
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from typing import List
import torch
device = "cuda:0"
dir=sys.argv[1] + "persiannlp"
size="base"
mname = f'{dir}/data/mt5-{size}-parsinlu-translation_en_fa'
tokenizer = MT5Tokenizer.from_pretrained(mname)
model = MT5ForConditionalGeneration.from_pretrained(mname)
model = model.to(device)
lines = []
while True:
for line in sys.stdin:
line = line.strip()
if line == 'EOD':
inputs = tokenizer(lines, return_tensors="pt", padding=True).to(device)
translated = model.generate(**inputs).to(device)
[print(tokenizer.decode(t, skip_special_tokens=True)) for t in translated]
print('EOL')
sys.stdout.flush()
lines.clear()
elif line.startswith('EOF'):
sys.exit(0)
else:
lines.append(line)
sys.exit(0)
@gregorybrooks sorry for the delayed response! 👋 I am not sure what is the root of the issue, unfortunately. I tried the online demo here and it seems to match your observations. I am honestly not sure why this happening. Just sharing my 2 cents:
- Data: I think the data has longer sentences (you should be able to verify this).
- Decoding: it's possible that Huggingface has some strategy config/behavior that we don't quite understand
- Training: it might be that we made a mistake in training these models. If so, maybe it's worth training your model, from scratch and monitoring its behavior.
Same Problem when I try it on google colab.
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
model_size = "small"
model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
def run_model(input_string, **generator_args):
input_ids = tokenizer.encode(input_string, return_tensors="pt")
print(input_ids)
print(len(input_ids[0]))
res = model.generate(input_ids, **generator_args)
print(res)
print(len(res[0]))
output = tokenizer.batch_decode(res, skip_special_tokens=True)
print(output)
sent = "The Iran–Iraq War was a protracted armed conflict that began on 22 September 1980 with a full-scale invasion of Iran by neighbouring Iraq. The war lasted for almost eight years, and ended in a stalemate on 20 August 1988, when Iran accepted Resolution 598 of the United Nations Security Council."
run_model(sent)
Result:
tensor([[ 486, 19255, 1326, 36986, 4576, 639, 259, 262, 731,
99155, 345, 259, 178869, 31320, 533, 390, 2739, 351,
1024, 3258, 17522, 514, 259, 262, 3622, 264, 31749,
259, 154171, 304, 19255, 455, 259, 134309, 347, 259,
36986, 260, 486, 2381, 3167, 345, 332, 259, 262,
28746, 49889, 3127, 261, 305, 259, 57830, 281, 259,
262, 28604, 79328, 351, 628, 3155, 18494, 261, 259,
1909, 19255, 12004, 345, 259, 91698, 147677, 304, 287,
4248, 259, 35577, 19004, 28996, 260, 1]])
79
tensor([[ 0, 10948, 4379, 341, 259, 35125, 343, 2665, 259, 11783,
376, 259, 22838, 7244, 85200, 33040, 376, 3418, 934, 509]])
20
['جنگ ایران و عراق، یک حمله طولانی مسلحانه بود که در']
You can see the the tokenizer is doing a good job but the model is really limiting the output length. A work around is to add max_length
to the model arguments so it generates more tokens:
def run_model(input_string, **generator_args):
input_ids = tokenizer.encode(input_string, return_tensors="pt")
print(input_ids)
print(len(input_ids[0]))
res = model.generate(input_ids, max_length=100, **generator_args)
print(res)
print(len(res[0]))
output = tokenizer.batch_decode(res, skip_special_tokens=True)
print(output)
Result:
tensor([[ 486, 19255, 1326, 36986, 4576, 639, 259, 262, 731,
99155, 345, 259, 178869, 31320, 533, 390, 2739, 351,
1024, 3258, 17522, 514, 259, 262, 3622, 264, 31749,
259, 154171, 304, 19255, 455, 259, 134309, 347, 259,
36986, 260, 486, 2381, 3167, 345, 332, 259, 262,
28746, 49889, 3127, 261, 305, 259, 57830, 281, 259,
262, 28604, 79328, 351, 628, 3155, 18494, 261, 259,
1909, 19255, 12004, 345, 259, 91698, 147677, 304, 287,
4248, 259, 35577, 19004, 28996, 260, 1]])
79
tensor([[ 0, 10948, 4379, 341, 259, 35125, 343, 2665, 259,
11783, 376, 259, 22838, 7244, 85200, 33040, 376, 3418,
934, 509, 1024, 15140, 636, 68820, 18430, 122748, 768,
2741, 130744, 8878, 572, 695, 4379, 554, 259, 13361,
259, 35125, 259, 17213, 3164, 260, 10948, 22625, 59491,
259, 37033, 3037, 259, 22838, 20275, 1555, 341, 509,
3939, 2408, 259, 27895, 48129, 153840, 259, 26598, 259,
14594, 343, 259, 5143, 406, 4379, 259, 9898, 259,
13727, 1845, 14727, 6916, 572, 916, 259, 30887, 3716,
260, 1]])
83
['جنگ ایران و عراق، یک حمله طولانی مسلحانه بود که در 22 سپتامبر ۲۰۸۰ با تهاجم کامل از ایران به توسط عراق شروع شد. جنگ تقریبا هشت سال طول کشید و در بیست اوت ۱۹۸۸ پایان یافت، وقتی ایران مجلس امنیت سازمان ملل را قبول کرد.']
max_length
is None
by default so there should not be any limit to how many tokens the model generate so I am not sure why this problem exists in the first place.