transformers Misalignment between documentation and implementation of mBART50 tokenisation for the decoder

Misalignment between documentation and implementation of mBART50 tokenisation for the decoder

Open devaansh100 opened this issue 1 year ago • 15 comments

System Info

transformers version: 4.23.1
Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.14
Huggingface_hub version: 0.10.1
PyTorch version (GPU?): 1.12.1+cu113 (False)
Tensorflow version (GPU?): 2.8.2 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@patil-suraj @SaulLu

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

The bug has been reproduced in the outputs of this colab notebook. The following are the steps to be followed:

Make a copy of the notebook.
Execute the first 2 cells.
In the source file for mbart(/usr/local/bin/python3.7/dist-packages/transformers/models/mbart/modeling_mbart.py), on line 1352(above outputs = self.model(..., after the if labels is not None block), add print(f'Decoder Input Ids: {decoder_input_ids}\nLabels: {labels}').
Restart the runtime for the changes in the library to take place.
Run the third cell. The output is:

Decoder Input Ids: tensor([[     2, 250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,
            315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577]])
Labels: tensor([[250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,    315,
          42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2]])

Expected behavior

I was looking into fine-tuning facebook/mbart-large-50 through this example in the documentation. As per the description, the expected input for the model is of the form [lang_id] tokens [eos] for both the encoder and the decoder.

While the MBart50Tokenizer produces outputs in the expected format, the decoder_input_ids get transformed to an incorrect one - [eos] [lang_id] tokens. Specifically, I believe the output should have been the following(do correct me if I am wrong here though):

Decoder Input Ids: tensor([[   250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,
            315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577, 2]])
Labels: tensor([[47711,   7844, 127666,      8,  18347,  18147,   1362,    315,
          42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2,  250020]])

This is caused since the shift_tokens_right function does not seem to be adapted for mbart50. As per the docstring of this function,

wrap the last non pad token (the [LID] token)

however, for an mbart50, the last non pad token would be an eos.

Additional question: Why should the [eos] token predict the [lang_id]? This happens in both mbart and mbart50. If not, should the last token in the labels be -100? If yes, there would be a subsequent issue, since the labels matrix from the tokenizer seems to be using 1 as the padding token instead of -100. Do let me know if I would be required to open the same!

If this bug seems legitimate, I would be glad to provide a fix for the same! I believe the labels key from MBart50Tokenizer would have to be updated to give the same output as the MBartTokenizer.

Oct 11 '22 21:10 devaansh100

@ArthurZucker, when you have bandwidth, would you like to take a look at this?

Oct 12 '22 21:10 LysandreJik

Not stale, still looking forward to a response!

Nov 11 '22 17:11 devaansh100

Hey! This is very similar to #18133. First, I was not really able to reproduce the bug as the output of

tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")

src_text = " UN Chief Says There Is No Military Solution in Syria"
tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"

model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt").input_ids

Gave :

tensor([[250004,   8274, 127873,  25916,      7,   8622,   2071,    438,  67485,
             53, 187895,     23,  51712,      2]])

But in any case, I think that you are right, the documentation for both model is not aligned as the input is not shifted in the tokenizer but rather in the model. This was already mentioned so might as well adresse it !

Nov 17 '22 17:11 ArthurZucker

Hey! While the output of the tokenizer is correct(both input_ids and labels in the same format), the labels are going to pass through the shift_tokens_right to create the decoder_input_ids.

The shift_tokens_right expects the LID token at the end, however, MBart50Tokenizer will give an EOS token, therefore, the input to the decoder will end up being wrong.

Regarding the reproduction of the issue - do you mean reproducing the referenced issue? If yes, they are using the MBartTokenizer, while the code mentioned here uses the MBart50Tokenizer

Nov 17 '22 17:11 devaansh100

I'll come back to this soon 😉

Dec 12 '22 15:12 ArthurZucker

It seems that this has been confusing a lot of people (including me, see #20610, ).

Let's work with your example:

src_text : en_XX UN Chief Says There Is No Military Solution in Syria</s>
labels : ro_RO Şeful ONU declară că nu există o soluţie militară în Siria</s>
shifted labels : </s>ro_RO Şeful ONU declară că nu există o soluţie militară în Siria (= decoder_inputs_ids)

We are interested in supervised training where you feed the model with inputs_ids and labels. For most of the encoder decoder models, the labels are shifted to the right, so that the model will predict the next token in a MLM manner.

This means that if the decoder_input_ids are

[ 2, 250020,  47711,   7844, 127666,      8,  18347,  18147,   1362, 315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577]

Then the model (if it is a perfect model) predicts

[250020,  47711,   7844, 127666,      8,  18347,  18147,   1362, 315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577, 2]

Which is then compared to the loss. That is also why when you generate (inference) you force the beginning token with </s>.

Feb 03 '23 14:02 ArthurZucker

Thanks for the clarification, @ArthurZucker! It still seems a bit wrong to expect the model to predict ro_RO given </s>. The comment wrap the last non pad token (the <LID> token) in code is also somewhat confusing!

Feb 11 '23 23:02 LoicGrobol

I agree with @LoicGrobol.

I also want to clarify this example from the docs:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."

# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "The Secretary-General of the United Nations says there is no military solution in Syria."

That is also why when you generate (inference) you force the beginning token with </s>.

Is this the same forced_bos_token mentioned in the code? If yes, then should we force it be the <\s> token, rather than ro_RO as done in the code?

Feb 12 '23 07:02 devaansh100

Okay, indeed @LoicGrobol we should not compute the loss on all the forced decoder ids. This seems to apply to a few models, so will open a PR to fix these and add some documentation to properly explain all of this.

@devaansh100, no we have the bos_token_id # </s> and the forced_decoder_ids #<LID> which should ensures that we start with 2 tokens.

Thanks both of you for your feedback.

Feb 20 '23 16:02 ArthurZucker

Note that currently this training seems to be needed to work well with generate in e.g. mBART, which uses </s> LANGID as its forced prompt (loss on the LANGID could indeed be skipped though). I guess that would have to changed too but I don't know how to make that change work with existing pretrained models.

Feb 20 '23 16:02 LoicGrobol

Not sure I understand why we need to change the generate? We should not

Feb 20 '23 17:02 ArthurZucker

The only logic that need to be updated is computing the loss on the lang token. The rest of the training procedure etc is still correct! It's just that we don't want the model to learn a distribution/update it's distribution when predicting the second token, because it can be variable at training, but will always be fixed at inference

Feb 20 '23 17:02 ArthurZucker

Got it! I guess the only change then would be in the "labels" from the tokenizer then - where there was LANGID initially, we would have a 1/-100?

Feb 20 '23 17:02 devaansh100

Yep! That should be it 😉

Feb 20 '23 17:02 ArthurZucker

Thank you for all the help!

Feb 20 '23 17:02 devaansh100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 13 '23 15:04 github-actions[bot]

transformers transformers copied to clipboard

Misalignment between documentation and implementation of mBART50 tokenisation for the decoder

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard