parsinlu icon indicating copy to clipboard operation
parsinlu copied to clipboard

Machine translation - long sentences cause incomplete translation

Open gregorybrooks opened this issue 3 years ago • 2 comments

I'm translating English sentences into Farsi with mt5-base-parsinlu-translation_en_fa (from Huggingface). Sentences longer than around 8 words result in the translation of the first part of the sentence, but the rest of the sentence is ignored. For example:

English sentences:

Terry's side fell to their second Premier League loss of the season at Loftus Road

Following a four-day hiatus, UN envoy Ismail Ould Cheikh Ahmed on Thursday will resume mediation efforts in the second round of Kuwait-hosted peace talks between Yemen’s warring rivals.

Mark Woods is a writer and broadcaster who has covered the NBA, and British basketball, for over a decade.

Translations:

طرفدار تری در فوتبال دوم فصل در لئوپوس رود به

پس از چهار روز توقف، سفیر سازمان ملل، ایمیل اولد شیخ

مارک ولز نویسنده و پخش کننده ای است که بیش از یک دهه

which according to Google Translate translates back to this:

More fans in the second football season in Leopard

After a four-day hiatus, the ambassador to the United Nations, Old Sheikh Sheikh

Mark Wells has been a writer and broadcaster for over a decade

I can't find any configuration settings that would be limiting the number of tokens being translated Here is my code:

#!/usr/bin/python3
import sys
#from transformers import MarianTokenizer, MarianMTModel
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from typing import List
import torch

device = "cuda:0"

dir=sys.argv[1] + "persiannlp"
size="base"
mname = f'{dir}/data/mt5-{size}-parsinlu-translation_en_fa'

tokenizer = MT5Tokenizer.from_pretrained(mname)
model = MT5ForConditionalGeneration.from_pretrained(mname)
model = model.to(device)

lines = [] 
while True:
    for line in sys.stdin:
        line = line.strip()
        if line == 'EOD':
            inputs    = tokenizer(lines, return_tensors="pt", padding=True).to(device)
            translated   = model.generate(**inputs).to(device)
            [print(tokenizer.decode(t, skip_special_tokens=True)) for t in translated]
            print('EOL')
            sys.stdout.flush()
            lines.clear()
        elif line.startswith('EOF'):
            sys.exit(0)
        else:
            lines.append(line)
sys.exit(0)

gregorybrooks avatar Nov 08 '21 21:11 gregorybrooks

@gregorybrooks sorry for the delayed response! 👋 I am not sure what is the root of the issue, unfortunately. I tried the online demo here and it seems to match your observations. I am honestly not sure why this happening. Just sharing my 2 cents:

  • Data: I think the data has longer sentences (you should be able to verify this).
  • Decoding: it's possible that Huggingface has some strategy config/behavior that we don't quite understand
  • Training: it might be that we made a mistake in training these models. If so, maybe it's worth training your model, from scratch and monitoring its behavior.

danyaljj avatar Nov 15 '21 02:11 danyaljj

Same Problem when I try it on google colab.

from transformers import MT5ForConditionalGeneration, MT5Tokenizer
model_size = "small"
model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    print(input_ids)
    print(len(input_ids[0]))
    res = model.generate(input_ids, **generator_args)
    print(res)
    print(len(res[0]))
    output = tokenizer.batch_decode(res, skip_special_tokens=True)
    print(output)

sent = "The Iran–Iraq War was a protracted armed conflict that began on 22 September 1980 with a full-scale invasion of Iran by neighbouring Iraq. The war lasted for almost eight years, and ended in a stalemate on 20 August 1988, when Iran accepted Resolution 598 of the United Nations Security Council."
run_model(sent)

Result:

tensor([[   486,  19255,   1326,  36986,   4576,    639,    259,    262,    731,
          99155,    345,    259, 178869,  31320,    533,    390,   2739,    351,
           1024,   3258,  17522,    514,    259,    262,   3622,    264,  31749,
            259, 154171,    304,  19255,    455,    259, 134309,    347,    259,
          36986,    260,    486,   2381,   3167,    345,    332,    259,    262,
          28746,  49889,   3127,    261,    305,    259,  57830,    281,    259,
            262,  28604,  79328,    351,    628,   3155,  18494,    261,    259,
           1909,  19255,  12004,    345,    259,  91698, 147677,    304,    287,
           4248,    259,  35577,  19004,  28996,    260,      1]])
79
tensor([[    0, 10948,  4379,   341,   259, 35125,   343,  2665,   259, 11783,
           376,   259, 22838,  7244, 85200, 33040,   376,  3418,   934,   509]])
20
['جنگ ایران و عراق، یک حمله طولانی مسلحانه بود که در']

You can see the the tokenizer is doing a good job but the model is really limiting the output length. A work around is to add max_length to the model arguments so it generates more tokens:

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    print(input_ids)
    print(len(input_ids[0]))
    res = model.generate(input_ids, max_length=100, **generator_args)
    print(res)
    print(len(res[0]))
    output = tokenizer.batch_decode(res, skip_special_tokens=True)
    print(output)

Result:

tensor([[   486,  19255,   1326,  36986,   4576,    639,    259,    262,    731,
          99155,    345,    259, 178869,  31320,    533,    390,   2739,    351,
           1024,   3258,  17522,    514,    259,    262,   3622,    264,  31749,
            259, 154171,    304,  19255,    455,    259, 134309,    347,    259,
          36986,    260,    486,   2381,   3167,    345,    332,    259,    262,
          28746,  49889,   3127,    261,    305,    259,  57830,    281,    259,
            262,  28604,  79328,    351,    628,   3155,  18494,    261,    259,
           1909,  19255,  12004,    345,    259,  91698, 147677,    304,    287,
           4248,    259,  35577,  19004,  28996,    260,      1]])
79
tensor([[     0,  10948,   4379,    341,    259,  35125,    343,   2665,    259,
          11783,    376,    259,  22838,   7244,  85200,  33040,    376,   3418,
            934,    509,   1024,  15140,    636,  68820,  18430, 122748,    768,
           2741, 130744,   8878,    572,    695,   4379,    554,    259,  13361,
            259,  35125,    259,  17213,   3164,    260,  10948,  22625,  59491,
            259,  37033,   3037,    259,  22838,  20275,   1555,    341,    509,
           3939,   2408,    259,  27895,  48129, 153840,    259,  26598,    259,
          14594,    343,    259,   5143,    406,   4379,    259,   9898,    259,
          13727,   1845,  14727,   6916,    572,    916,    259,  30887,   3716,
            260,      1]])
83
['جنگ ایران و عراق، یک حمله طولانی مسلحانه بود که در 22 سپتامبر ۲۰۸۰ با تهاجم کامل از ایران به توسط عراق شروع شد. جنگ تقریبا هشت سال طول کشید و در بیست اوت ۱۹۸۸ پایان یافت، وقتی ایران مجلس امنیت سازمان ملل را قبول کرد.']

max_length is None by default so there should not be any limit to how many tokens the model generate so I am not sure why this problem exists in the first place.

ali-abz avatar Mar 26 '22 20:03 ali-abz