transformers Set FSDP `transformer_layer_cls_to_wrap` to `model._no_split

Feature request

Currently, when training with FSDP, the Trainer expects to receive an fsdp_config argument specifying fsdp_transformer_layer_cls_to_wrap.

https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/src/transformers/trainer.py#L1394-L1406

I am wondering if we can set this automatically, when the model has a _no_split_modules attribute, e.g.

https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/src/transformers/models/opt/modeling_opt.py#L401

Motivation

It would be a convenient feature to set this automatically. This argument is model-specific, but it might be nice to define training arguments independently of a specific model type.

Your contribution

Happy to help make a PR. Would be great if you can confirm whether this would be desirable or if I am misunderstanding something. Thanks!

Jun 29 '23 05:06 apoorvkh

cc @pacman100

Jun 29 '23 12:06 sgugger

Any thoughts about this? Maybe also cc @stas00?

Jul 20 '23 19:07 apoorvkh

Unfortunately I don't have experience with FSDP to contribute to this discussion.

Jul 20 '23 20:07 stas00

@pacman100 Friendly ping

Jul 20 '23 20:07 sgugger

Hello @apoorvkh, the code part you highlighted is enabled now only when using FSDP+XLA. For general FSDP, internally everything is handled by Accelerate. It happens here: https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/src/transformers/training_args.py#L1533-L1556

fsdp_transformer_layer_cls_to_wrap support specifying multiple modules but most of the time it is enough to specify the _no_split_modules. So, we can have _no_split_modules as a default in case the user doesn't specify it when passing --fsdp full_shard auto_wrap.

Jul 21 '23 06:07 pacman100

PRs https://github.com/huggingface/accelerate/pull/1753 and https://github.com/huggingface/transformers/pull/24980 should add this capability wherein it will try model. _no_split_modules if fsdp_transformer_layer_cls_to_wrap isn't specified. Can you try it out?

Jul 21 '23 09:07 pacman100

Very cool, thanks a ton! I will try it out and let you know.

Jul 21 '23 15:07 apoorvkh

Just circling back, works on my end -- thanks again!

Jul 26 '23 02:07 apoorvkh

@pacman100 I want to better understand the mechanism of FSDP's wrapping.

Do you know why transformer_layer_cls_to_wrap can be automatically assigned to _no_split_module by default?

My understanding of that latter is from this post:

Actually using this device map later on won't work, because the layers composing this model have residual connections (where the input of the block is added to the output of the block) so all of a given layer should be on the same device. We can indicate this to Accelerate by passing a list of module names that shouldn't be split with the no_split_module_classes keyword argument:

I understand this means that the module should not be split during the forward pass. However, I am not sure I see the connection with transformer_layer_cls_to_wrap, which seems to be a way to indicate which class should be wrapped by FSDP (this is based on my limited understanding of FSDP).

Is there a connection between those two variables, or is it simply a way to quickly find the name of the transformer layers (since it is named with a convention of {model_name}DecoderLayer but it is not always consistent)?

Nov 28 '23 19:11 xhluca

transformers
transformers copied to clipboard

Set FSDP `transformer_layer_cls_to_wrap` to `model._no_split_modules` ?

Feature request

Motivation

Your contribution

transformers transformers copied to clipboard

Set FSDP `transformer_layer_cls_to_wrap` to `model._no_split_modules` ?

Feature request

Motivation

Your contribution

transformers
transformers copied to clipboard