transformers
transformers copied to clipboard
Set FSDP `transformer_layer_cls_to_wrap` to `model._no_split_modules` ?
Feature request
Currently, when training with FSDP, the Trainer expects to receive an fsdp_config
argument specifying fsdp_transformer_layer_cls_to_wrap
.
https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/src/transformers/trainer.py#L1394-L1406
I am wondering if we can set this automatically, when the model has a _no_split_modules
attribute, e.g.
https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/src/transformers/models/opt/modeling_opt.py#L401
Motivation
It would be a convenient feature to set this automatically. This argument is model-specific, but it might be nice to define training arguments independently of a specific model type.
Your contribution
Happy to help make a PR. Would be great if you can confirm whether this would be desirable or if I am misunderstanding something. Thanks!
cc @pacman100
Any thoughts about this? Maybe also cc @stas00?
Unfortunately I don't have experience with FSDP to contribute to this discussion.
@pacman100 Friendly ping
Hello @apoorvkh, the code part you highlighted is enabled now only when using FSDP+XLA. For general FSDP, internally everything is handled by Accelerate. It happens here: https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/src/transformers/training_args.py#L1533-L1556
fsdp_transformer_layer_cls_to_wrap
support specifying multiple modules but most of the time it is enough to specify the _no_split_modules
. So, we can have _no_split_modules
as a default in case the user doesn't specify it when passing --fsdp full_shard auto_wrap
.
PRs https://github.com/huggingface/accelerate/pull/1753 and https://github.com/huggingface/transformers/pull/24980 should add this capability wherein it will try model. _no_split_modules
if fsdp_transformer_layer_cls_to_wrap
isn't specified. Can you try it out?
Very cool, thanks a ton! I will try it out and let you know.
Just circling back, works on my end -- thanks again!
@pacman100 I want to better understand the mechanism of FSDP's wrapping.
Do you know why transformer_layer_cls_to_wrap
can be automatically assigned to _no_split_module
by default?
My understanding of that latter is from this post:
Actually using this device map later on won't work, because the layers composing this model have residual connections (where the input of the block is added to the output of the block) so all of a given layer should be on the same device. We can indicate this to Accelerate by passing a list of module names that shouldn't be split with the no_split_module_classes keyword argument:
I understand this means that the module should not be split during the forward pass. However, I am not sure I see the connection with transformer_layer_cls_to_wrap
, which seems to be a way to indicate which class should be wrapped by FSDP (this is based on my limited understanding of FSDP).
Is there a connection between those two variables, or is it simply a way to quickly find the name of the transformer layers (since it is named with a convention of {model_name}DecoderLayer
but it is not always consistent)?