transformers NllbTokenizer/NllbTokenizerFast inserts language code incorrectly when tokenizing target text

System Info

transformers version: 4.23.1
Platform: Windows-10-10.0.19044-SP0
Python version: 3.8.10
Huggingface_hub version: 0.10.1
PyTorch version (GPU?): 1.10.1+cu111 (True)
Tensorflow version (GPU?): 2.7.3 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@LysandreJik @sgugger @SaulLu

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

According to the documentation for NllbTokenizer,

The tokenization method is <tokens> <eos> <language code> for source language documents, and <language code> <tokens> <eos> for target language documents.

When you tokenize target text, it incorrectly inserts the language code at the end of the sentence instead of the beginning.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer.tgt_lang = "eng_Latn"
article = "UN Chief says there is no military solution in Syria"
tokens = tokenizer(text_target=article).tokens()

tokens has the value:

['▁UN', '▁Chief', '▁says', '▁there', '▁is', '▁no', '▁military', '▁solution', '▁in', '▁Syria', '</s>', 'eng_Latn']

Expected behavior

tokens should have the value:

['eng_Latn', '▁UN', '▁Chief', '▁says', '▁there', '▁is', '▁no', '▁military', '▁solution', '▁in', '▁Syria', '</s>']

Oct 28 '22 08:10 ddaspit

After further research, I believe that it is intended that PretrainedConfig.decoder_start_token_id should be used to insert the lang code at the beginning of the sentence during fine tuning of NLLB and that the NllbTokenizer class is working as intended. If that is true, then the documentation for NllbTokenizer should be corrected, and the run_translation.py script should be fixed to properly set decoder_start_token_id when the NllTokenizer is being used (similar to mBART).

Nov 03 '22 08:11 ddaspit

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 27 '22 15:11 github-actions[bot]

@ddaspit where did you get the information "after further research" ? When I read the mBart PAper, indeed, the Lang Token is suffixed after the source sequence and the the target lang token is prefixed. When reading the NLLB paper, it says the Lang token are both prefixed to SRC and TGT (page 48). What I don't understand is what and where exact BOS / EOS tokens are used. Usually we don't put any EOS/EOS in Source and we put BOS/EOS at the beg/end of the target sequence (at training).

Did you get different info?

Mar 18 '23 09:03 vince62s

You are correct @vince62s , the paper clearly states that contrary to other model, the src_lang token is placed at the beginning of the input sequence. When adding the NLLB-200 to the library, I checked that the output are a tiny bit different if you change this behavior. The fix is to change the prefix token and the suffix token attributes of the tokenizer class. Will open a PR after checking that this will not affect the current setup. The BOS token is never used in fairseq, while the EOS is used as the BOS token. Indeed the decoder_input_ids that are passed to the model are [eos, tgt_lang, ...] (when generating) and [tgt_lang, ..., eos] when training.

Mar 22 '23 13:03 ArthurZucker

One slight correction: [eos, tgt_lang, ..., eos] on target side at training. Since I implemented NLLB-200 support in OpenNMT-py, I can confirm that prefixing instead of suffixing improves BLEU scores slightly. https://forum.opennmt.net/t/nllb-200-with-opennmt-py-the-good-the-bad-and-the-ugly/5151

Mar 22 '23 13:03 vince62s

The "research" I was referring to was entirely about how to properly use HuggingFace Transformers to prefix the target sentences with the lang code during decoding. I was not aware that source sentences were supposed to be prefixed with the lang code. We use NLLB pretty heavily, so I would be very happy to see this fixed.

Mar 24 '23 10:03 ddaspit

Hi everyone! I was also concerned with the behavior of the NLLB tokenizer at HF, so, even before discovering this issue, I made two of my own "experiments" to verify that the behavior of the tokenizer should be fixed.

I computed BLEU for one high-resource language pair (English-French) and one low-resource (Yoruba->Bashkir) from FLORES-200 with [..., src_lang, eos] and [src_lang, ..., eos] templates. For both directions, a significant proportion of translations (20% and 66%) is different depending on the tokenization method. For eng-fra, the new tokenization leads to a small loss in BLEU (-0.09), whereas for yor-bak, there is a larger gain (+0.21). While a thorough research would require investingating more translation direction, these results already hint that [src_lang, ..., eos] tokenization may be beneficial.
I tweaked the Fairseq code for inferencing the original NLLB implementation so that it prints full tokenized source and translation during inference. The output confirms that the original implementation uses [src_lang, ..., eos] as source and [tgt_lang, ..., eos] as translation output (which implies using [eos, tgt_lang, ...] as a translation decoder input during training, because, as stated in the comments above, Fairseq uses eos instead of bos for translation).

The code and outputs for both experiments can be found in my Colab notebook.

These experiments confirm that #22313 is implementing exactly the right tokenization method; great thanks for that!

Mar 31 '23 21:03 avidale

Awesome, thanks for fixing this issue.

Apr 14 '23 04:04 ddaspit

Hi All, a slight continuation of the above because I had assumed the inclusion of language code was a bug.. and now seeing it's intentionally included, my question is:

Q: is it possible to have the language code NOT included in the output of the inference results?

In practice, I already know what the output language code is because I sent it to the model. I don't need it duplicated in the results. This only causes unnecessary post-process cleaning that adds more if-then specific cases to my code.

I've tried removing "forced_bos_token_id=bos_target_lang," from the generation, but that impairs the translation completely.

Is it possible to switch this behaviour off? the only output I want from inference is the translation of the input text.

Thanks

Nov 23 '23 00:11 gidzr

Self-helped.. answer my own question.. add skip_special_tokens in the input_toks = tokenizer('textstuff', skip_special_tokens=True)

Nov 23 '23 01:11 gidzr

I got the same issue with src language being set to Latin by default and not wanting to change at all. this seems like a straight up bug because the code from the huggingface website returns the wrong result.

I can overwrite stuff manually but this is not to the spec

Dec 24 '23 07:12 nevakrien

for anyone in the new version of hf that has issues with this here is the easy (dirty) fix

def translate_text(text):
    # Tokenize and translate the text
    encoded_text = tokenizer(text, return_tensors="pt")
    #manual fix to hf bug 
    encoded_text['input_ids'][:,1]=tokenizer.lang_code_to_id[tokenizer.src_lang]
    
    generated_tokens = model.generate(**encoded_text,forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang])

    # Decode and return the translated text
    return tokenizer.decode(generated_tokens[0])#, skip_special_tokens=True)


text_to_translate = "Hello, how are you?"

translated_text = translate_text(text_to_translate)
print(translated_text)

notice I am overwriting wrong tokens generated by the tokenizer that shouldn't be there. this wont break versions with a working tokenizer. and since hf version causes these issues I would recommand not relying on the tokenizer to do its job if you plan on upgrading versions.

Dec 24 '23 08:12 nevakrien

Could you share a repro to your issue and the transformers version you are using? I am not sure I understand your recommendation:

I would recommand not relying on the tokenizer to do its job if you plan on upgrading versions.

the bug is fix and the language is properly handled so could you elaborate?

Jan 02 '24 17:01 ArthurZucker