transformers
transformers copied to clipboard
Deberta MaskedLM Corrections
What does this PR do?
The current implementations of DebertaForMaskedLM and DebertaV2ForMaskedLM do not load all of the weights from the checkpoints. After consulting the original repo, I modified the MaskedLM classes to load the weights correctly and to be able to be used for fill-mask task out of the box (for v1 and v2, v3 wasn't trained for that).
I didn't know what to implement for get_output_embeddings
and set_output_embeddings
.
TODO:
- [ ] Implement
get_output_embeddings
- [ ] Implement
set_output_embeddings
- [ ] Implement
resize_token_embeddings
Fixes # (issue) https://github.com/huggingface/transformers/issues/15216 https://github.com/huggingface/transformers/issues/15673 https://github.com/huggingface/transformers/issues/16456 https://github.com/huggingface/transformers/issues/18659
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
@LysandreJik @sgugger
I'm sorry this took so long.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
Thanks for working on this, @nbroad1881! This is a good improvement, but it will unfortunately break all existing models that have a head named cls
. I'm trying to see if there is a non-backward breaking approach that would enable loading the existing model head; it'll likely mean updating the weights in the repo rather than updating the code here.
I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it.
I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it.
Could there be two versions and anytime AutoModelForMaskedLM gets called in the future, it defaults to the new implementation but also checks the config.json file or state dict to see if it uses the old implementation?
Scenarios
- AutoModelForMaskedLM.from_pretrained/from_config(canonical repo) --> use new implementation
- AutoModelForMaskedLM.from_pretrained/from_config(custom repo/local path) --> check config.json/state dict to decide if using new/old implementation
One other question:
What should the get_output_embeddings
function do? BERT's implementation makes it look like it just returns the linear layer (decoder) that maps output_embeddings to token logits. This layer is slightly different for deberta. Instead of Linear(hidden_size, vocab_size)
it goes Linear(hidden_size, hidden_size)
and then there is another step where the output of that is multiplied by word embeddings.
@sgugger, do you have an opinion on this?
I am not sure I fully understand the problem here. It looks like the canonical repos have weights that mismatch our code. If this is the case, those weights should be updated to match the code in Transformers, not the opposite, to avoid breaking all other checkpoints.
It's not just a naming issue. The current code uses a different mechanism to make masked LM predictions.
Current way: hidden_states * linear layer -> logits for each token [batch_size, sequence_length, hidden_size] * [hidden_size, vocab_size] -> [batch_size, sequence_length, vocab_size]
The way it is done in the official deberta repo hidden_states * linear layer * word embeddings.T -> logits for each token [batch_size, sequence_length, hidden_size] * [hidden_size, hidden_size] * [hidden_size, vocab_size] -> [batch_size, sequence_length, vocab_size]
I skipped some operations that don't change the size of the tensors, but I think this proves my point.
If it is done the second way, then the fillmask pipeline will work (for deberta v1 and v2) from the canonical weights
Thanks for explaining @nbroad1881 I now understand the problem a little bit better. I don't think we can avoid having two classes for masked LM (for instance OldDebertaForMaskedLM
and NewDebertaForMaskedLM
) along with DebertaForMaskedLM
to dispatch to the proper one depending on the config, to be able to maintain backward compatibility.
If you want to amend the PR to write a new class for masked LM for now with the proper changes (leaving the current masked LM class as is), I can then follow-up with the rest and write this in a fully backward-compatible manner.
@sgugger, that sounds good to me. Do you know what I should put for the get_output_embeddings
and set_output_embeddings
functions?
It needs to be the weights/bias that have the vocab_size dim.
It needs to be the weights/bias that have the vocab_size dim.
There are weights that are [hidden_size, hidden_size] and a bias that has [vocab_size] dimensions. Which one do I use?
Leave those two to None for now then. I'll add that in the followup PR.
@sgugger , I implemented both Old and New Deberta(V1/V2)ForMaskedLM and I'm wondering which should be used for AutoModelForMaskedLM. Since the other version doesn't have an associated Auto class, it will fail some tests
The classes OldDebertaForMaskedLM
and NewDebertaForMaskedLM
are not meant to be public. This is an internal artifact to maintain backward compatibility, the user will only use the DebertaForMaskedLM
class and a config parameter will internally decide which of the classes should be used.
For this PR, you should just add the NewDebertaForMaskedLM
without any change to the doc/auto classes and don't touch the current DebertaForMaskedLM
.
Ah ok. Got it. Thanks
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@nbroad1881 Do you want me to fully take over on this?
@sgugger, I made the changes and then made the mistake of diving too deeply into checking whether the EMD is correctly implemented. I don't think it is, but I'll leave that for someone else or another time. Let me push the changes, and I'll ping you when I do.
Thanks for following up 🤗
On second thought, you should just take it over @sgugger. Let me know if you have questions
Ok, will have a look early next week!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
unstale, still planning to address this!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@ArthurZucker has taken over this as part of his refactor of the DeBERTa model.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.