What does this PR do?

The current implementations of DebertaForMaskedLM and DebertaV2ForMaskedLM do not load all of the weights from the checkpoints. After consulting the original repo, I modified the MaskedLM classes to load the weights correctly and to be able to be used for fill-mask task out of the box (for v1 and v2, v3 wasn't trained for that).

I didn't know what to implement for get_output_embeddings and set_output_embeddings.

TODO:

[ ] Implement get_output_embeddings
[ ] Implement set_output_embeddings
[ ] Implement resize_token_embeddings

Fixes # (issue) https://github.com/huggingface/transformers/issues/15216 https://github.com/huggingface/transformers/issues/15673 https://github.com/huggingface/transformers/issues/16456 https://github.com/huggingface/transformers/issues/18659

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

@LysandreJik @sgugger

I'm sorry this took so long.

Aug 17 '22 22:08 nbroad1881

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Aug 17 '22 22:08 HuggingFaceDocBuilderDev

Thanks for working on this, @nbroad1881! This is a good improvement, but it will unfortunately break all existing models that have a head named cls. I'm trying to see if there is a non-backward breaking approach that would enable loading the existing model head; it'll likely mean updating the weights in the repo rather than updating the code here.

I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it.

Aug 24 '22 16:08 LysandreJik

I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it.

Could there be two versions and anytime AutoModelForMaskedLM gets called in the future, it defaults to the new implementation but also checks the config.json file or state dict to see if it uses the old implementation?

Scenarios

AutoModelForMaskedLM.from_pretrained/from_config(canonical repo) --> use new implementation
AutoModelForMaskedLM.from_pretrained/from_config(custom repo/local path) --> check config.json/state dict to decide if using new/old implementation

One other question: What should the get_output_embeddings function do? BERT's implementation makes it look like it just returns the linear layer (decoder) that maps output_embeddings to token logits. This layer is slightly different for deberta. Instead of Linear(hidden_size, vocab_size) it goes Linear(hidden_size, hidden_size) and then there is another step where the output of that is multiplied by word embeddings.

Aug 24 '22 17:08 nbroad1881

@sgugger, do you have an opinion on this?

Sep 07 '22 04:09 nbroad1881

I am not sure I fully understand the problem here. It looks like the canonical repos have weights that mismatch our code. If this is the case, those weights should be updated to match the code in Transformers, not the opposite, to avoid breaking all other checkpoints.

Sep 07 '22 12:09 sgugger

It's not just a naming issue. The current code uses a different mechanism to make masked LM predictions.

Current way: hidden_states * linear layer -> logits for each token [batch_size, sequence_length, hidden_size] * [hidden_size, vocab_size] -> [batch_size, sequence_length, vocab_size]

The way it is done in the official deberta repo hidden_states * linear layer * word embeddings.T -> logits for each token [batch_size, sequence_length, hidden_size] * [hidden_size, hidden_size] * [hidden_size, vocab_size] -> [batch_size, sequence_length, vocab_size]

I skipped some operations that don't change the size of the tensors, but I think this proves my point.

If it is done the second way, then the fillmask pipeline will work (for deberta v1 and v2) from the canonical weights

Sep 07 '22 13:09 nbroad1881

Thanks for explaining @nbroad1881 I now understand the problem a little bit better. I don't think we can avoid having two classes for masked LM (for instance OldDebertaForMaskedLM and NewDebertaForMaskedLM) along with DebertaForMaskedLM to dispatch to the proper one depending on the config, to be able to maintain backward compatibility.

If you want to amend the PR to write a new class for masked LM for now with the proper changes (leaving the current masked LM class as is), I can then follow-up with the rest and write this in a fully backward-compatible manner.

Sep 07 '22 14:09 sgugger

@sgugger, that sounds good to me. Do you know what I should put for the get_output_embeddings and set_output_embeddings functions?

Sep 07 '22 14:09 nbroad1881

It needs to be the weights/bias that have the vocab_size dim.

Sep 07 '22 14:09 sgugger

It needs to be the weights/bias that have the vocab_size dim.

There are weights that are [hidden_size, hidden_size] and a bias that has [vocab_size] dimensions. Which one do I use?

Sep 07 '22 15:09 nbroad1881

Leave those two to None for now then. I'll add that in the followup PR.

Sep 07 '22 15:09 sgugger

@sgugger , I implemented both Old and New Deberta(V1/V2)ForMaskedLM and I'm wondering which should be used for AutoModelForMaskedLM. Since the other version doesn't have an associated Auto class, it will fail some tests

Sep 22 '22 14:09 nbroad1881

The classes OldDebertaForMaskedLM and NewDebertaForMaskedLM are not meant to be public. This is an internal artifact to maintain backward compatibility, the user will only use the DebertaForMaskedLM class and a config parameter will internally decide which of the classes should be used.

For this PR, you should just add the NewDebertaForMaskedLM without any change to the doc/auto classes and don't touch the current DebertaForMaskedLM.

Sep 22 '22 15:09 sgugger

Ah ok. Got it. Thanks

Sep 22 '22 16:09 nbroad1881

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 17 '22 15:10 github-actions[bot]

@nbroad1881 Do you want me to fully take over on this?

Nov 03 '22 15:11 sgugger

@sgugger, I made the changes and then made the mistake of diving too deeply into checking whether the EMD is correctly implemented. I don't think it is, but I'll leave that for someone else or another time. Let me push the changes, and I'll ping you when I do.

Thanks for following up 🤗

Nov 03 '22 15:11 nbroad1881

On second thought, you should just take it over @sgugger. Let me know if you have questions

Nov 04 '22 06:11 nbroad1881

Ok, will have a look early next week!

Nov 04 '22 12:11 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 28 '22 15:11 github-actions[bot]

unstale, still planning to address this!

Nov 28 '22 15:11 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 23 '22 15:12 github-actions[bot]

@ArthurZucker has taken over this as part of his refactor of the DeBERTa model.

Dec 24 '22 07:12 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 11 '23 15:02 github-actions[bot]

transformers
transformers copied to clipboard

Deberta MaskedLM Corrections

What does this PR do?

TODO:

Before submitting

Who can review?

transformers transformers copied to clipboard

Deberta MaskedLM Corrections

What does this PR do?

TODO:

Before submitting

Who can review?

transformers
transformers copied to clipboard