Add position ids in forward pass to opt model
What does this PR do?
This pull request adds position_ids to the forward of OPT in a similar fashion to gemma and llama. #32937
Some models didn't have an argument for position_ids in their forward pass.
There are two main reasons we would like for all LM models to get positions ids.
- to have the API be consistent with all models.
- position_ids are very important if you want to use flash-attention without padding, during training. if i want to be able to pack two or more sentences in the same sequence. I would like to know that the model handles the sentences accordingly and treats each sentence as it's own different sentence. flash-attention code uses position_ids to check if some sequences are packed and runs an appropriate function to make sure there is no cross-example contamination. but without this, the model can't use this feature. the code always checks if position_ids is not None:
https://github.com/huggingface/transformers/blob/v4.44.1/src/transformers/modeling_flash_attention_utils.py#L270
This handles only OPT, so i can start small and get some feedback.
changes:
- changed OPTLearnedPositionalEmbedding forward method to get position_ids instead of attention_mask.
- changed all forward passes in the file to get position_ids and pass them forward.
- update prepare_inputs_for_generation to get position_ids.
- if position_ids are
Nonecreate position_ids based on attention (similar to original version, so it should work the same if position_ids are not given) - update relevent docs
a few notes:
- this model because of the use of offset in OPTLearnedPositionalEmbedding needs to represent it's padding token position as -1. I think this is fine to keep compatibility with the weights and everything.
- BioGPT seems to copy OPT's positional embedding class. I needed to remove the comment to disable that (since it no longer copies). this is needed for
make fixup.
feature-request #32937
Before submitting
- [] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [] Did you write any new necessary tests?
@ArthurZucker would love some feedback.
I tried to keep the API as similar as possible to llama and gamma. they both put it third. isn't it kinda of a problem with AutoModel to have them at different placements? it means that you can't load using AutoModel and use the model without keyword arguments?
Thanks for the feedback!
about the forward call for the embedding layer. I think it has to take position ids as an argument. otherwise it will not work with packed sentences.
@ArthurZucker I am thinking that maybe the best solution for the embedding layer is to add position_ids as an arg to the forward pass with default None. this is probably backward compatible, but will still help with packed sentences. the problem is probably that the code will not be very nice
@gante, thanks! Happy to contribute.
I would love some guidance on the last two comments left, what should I do with the position_ids in the embedding module. In my opion it should be able to get position_ids to work with packed sentences. Maybe a last argument with default None and a check?
And about the one liners, would love some guidance on that
@ArthurZucker would love some guidance here so i can finish and move on to other models.
@ArthurZucker changed what you said and refactored the embedding class to be backward compatible. would love some feedback
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.