fairseq
fairseq copied to clipboard
Hallucination with numbers : NLLB English to Spanish translation
Incorrect translation when translating English to spanish.
When source input contains only numbers, no letters the translated output is completely incorrect. please see the examples below.
Model Used : NLLB-600m distilled.
Predicted Output
For input
print(predict('1'))
output : ['El 1']
print(predict('1 2 3 4 5 6 7 8 9 10')).
output: ['2 3 4 5 6 7 8 9 10']
print(predict('102'))
output : ['El número de personas']
from google es-en translate : El número de personas -> the number of people
print(predict('6171-1231-1311-1231')) :
['El número de personas que se encuentran en el mercado']
from google translate : El número de personas que se encuentran en el mercado -> The number of people in the market
###Slight improvement output when words are provided as context.
print(predict('it\'s 1'))
output ['Es un 1']
print(predict('we count number : 1 2 3 4 5 6 7 8 9 10'))
output ['Cuentan el número: 1 2 3 4 5 6 7 8 9 10']
print(predict('hey its a 102'))
output: ['Oye, es un 102']
print(predict('your code is 6171-1231-1311-1231'))
output: ['Su código es 6171-1231-1311-1231']
##### Output from facebook/nllb-200-1.3B
print(predict('1'))
output : ['El 1 de']
print(predict('1 2 3 4 5 6 7 8 9 10'))
output: ['1 2 3 4 5 6 7 8 9 10']
print(predict('102'))
output: ['102 y']
print(predict('6171-1231-1311-1231'))
output: ['6171-1231-1311-1231 El número de personas']
print(predict('it\'s 1'))
output: ['Es 1']
print(predict('we count number : 1 2 3 4 5 6 7 8 9 10'))
output: ['contamos el número: 1 2 3 4 5 6 7 8 9 10']
print(predict('hey its a 102'))
output: ['Es un 102.']
print(predict('your code is 6171-1231-1311-1231'))
output: ['Su código es 6171-1231-1311-1231'
This problem seems to persist when using bigger model.
Steps to reproduce the behavior
Code sample
from tqdm import tqdm
import time
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import NllbTokenizerFast
# model_name = "facebook/nllb-200-1.3B"
model_name = "facebook/nllb-200-distilled-600M"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
print('Loading Model')
t1 = time.time()
tokenizer = NllbTokenizerFast.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="spa_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
print('Time taken : ', time.time() - t1)
def predict(x):
inputs = tokenizer(x, return_tensors="pt", padding=True)
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["spa_Latn"], max_length=100
)
return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
Environment
- PyTorch Version (e.g., 1.0) : 1.12.0+cpu
- OS : Ubuntu
- Python version: 3.9.12
- transformers : 4.21.1
Possible corrections
- convert all numbers to words, this seems to help.
anyone?
Can confirm a similar issue. Here's our code:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M",
use_auth_token=False, local_files_only=True, src_lang="spa_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M",
use_auth_token=False, local_files_only=True)
article = "123"
inputs = tokenizer(article, return_tensors="pt")
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.encode("eng_Latn")[1], max_length=30)
output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(output)
Output:
The Commission shall adopt implementing acts.
When changing to src_lang="tgk_Cyrl" the output is 123 What is the meaning of life?