Added attention layers to wrap for fsdp
What does this PR do?
This PR defines the fsdp_transformer_layer_cls_to_wrap value in the Mistral config. This way user can easily load the config to figure out what values to use for FSDP, e.g.
from transformers import AutoConfig
c = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
c.fsdp_transformer_layer_cls_to_wrap
[out]:
MistralDecoderLayer
Context: Users has been asking when and which layer to wrap, there shouldn't be a need to load the model to figure it out by going through the state_dict of model summary,
Fixes: https://discuss.huggingface.co/t/accelerate-fsdp-config-prompts/21262/3
Currently, this information is also available through https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L809
Who can review?
Models:
- text models: @ArthurZucker and @younesbelkada
Integrations:
- deepspeed: HF Trainer/Accelerate: @pacman100
Documentation: @stevhliu and @MKhalusova
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.