fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Hallucination with numbers : NLLB English to Spanish translation

Open evilc3 opened this issue 2 years ago • 2 comments

Incorrect translation when translating English to spanish.

When source input contains only numbers, no letters the translated output is completely incorrect. please see the examples below.

Model Used : NLLB-600m distilled.

Predicted Output

For input

print(predict('1'))       
output :  ['El 1'] 

print(predict('1 2 3 4 5 6 7 8 9 10')). 
output: ['2 3 4 5 6 7 8 9 10']

print(predict('102')) 
output :  ['El número de personas']

from google es-en translate : El número de personas -> the number of people

print(predict('6171-1231-1311-1231')) : 
['El número de personas que se encuentran en el mercado']
from google translate : El número de personas que se encuentran en el mercado -> The number of people in the market
###Slight improvement  output when words are provided as context.

print(predict('it\'s 1'))   
output ['Es un 1']

print(predict('we count number : 1 2 3 4 5 6 7 8 9 10')) 
output ['Cuentan el número: 1 2 3 4 5 6 7 8 9 10']

print(predict('hey its a 102'))
output: ['Oye, es un 102']

print(predict('your code is 6171-1231-1311-1231'))
output: ['Su código es 6171-1231-1311-1231']
##### Output from facebook/nllb-200-1.3B
print(predict('1'))
output : ['El 1 de']

print(predict('1 2 3 4 5 6 7 8 9 10'))
output: ['1 2 3 4 5 6 7 8 9 10']

print(predict('102'))
output: ['102 y']

print(predict('6171-1231-1311-1231'))
output: ['6171-1231-1311-1231 El número de personas']

print(predict('it\'s 1'))
output: ['Es 1']

print(predict('we count number : 1 2 3 4 5 6 7 8 9 10'))
output: ['contamos el número: 1 2 3 4 5 6 7 8 9 10']

print(predict('hey its a 102'))
output: ['Es un 102.']

print(predict('your code is 6171-1231-1311-1231'))
output: ['Su código es 6171-1231-1311-1231'

This problem seems to persist when using bigger model.

Steps to reproduce the behavior

Code sample

from tqdm import tqdm
import time
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import NllbTokenizerFast

# model_name = "facebook/nllb-200-1.3B"
model_name = "facebook/nllb-200-distilled-600M"

# tokenizer = AutoTokenizer.from_pretrained(model_name)
print('Loading Model')
t1 = time.time()
tokenizer = NllbTokenizerFast.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="spa_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
print('Time taken : ', time.time() - t1)

def predict(x):

    inputs = tokenizer(x, return_tensors="pt", padding=True)

    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["spa_Latn"], max_length=100
    )

    return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)

Environment

  • PyTorch Version (e.g., 1.0) : 1.12.0+cpu
  • OS : Ubuntu
  • Python version: 3.9.12
  • transformers : 4.21.1

Possible corrections

  1. convert all numbers to words, this seems to help.

evilc3 avatar Nov 09 '22 07:11 evilc3

anyone?

evilc3 avatar Nov 11 '22 04:11 evilc3

Can confirm a similar issue. Here's our code:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M",
                                          use_auth_token=False, local_files_only=True, src_lang="spa_Latn")

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M",
                                              use_auth_token=False, local_files_only=True)

article = "123"
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
    **inputs, forced_bos_token_id=tokenizer.encode("eng_Latn")[1], max_length=30)

output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

print(output)

Output:

The Commission shall adopt implementing acts.

When changing to src_lang="tgk_Cyrl" the output is 123 What is the meaning of life?

jrobble avatar Jul 16 '24 00:07 jrobble