accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Accelerate 0.30.0 Breaks FSDP QLora

Open mallorbc opened this issue 9 months ago • 7 comments

System Info

See below a pip list output that does not work:

Package                  Version
------------------------ ---------------
accelerate               0.30.0
aiohttp                  3.9.5
aiosignal                1.3.1
annotated-types          0.6.0
async-timeout            4.0.3
attrs                    23.2.0
bitsandbytes             0.43.1
certifi                  2024.2.2
charset-normalizer       3.3.2
click                    8.1.7
datasets                 2.19.1
deepspeed                0.14.2+5f631abc
dill                     0.3.8
docker-pycreds           0.4.0
docstring_parser         0.16
einops                   0.8.0
eval_type_backport       0.2.0
exceptiongroup           1.2.1
filelock                 3.14.0
flash-attn               2.5.8
frozenlist               1.4.1
fsspec                   2024.3.1
gitdb                    4.0.11
GitPython                3.1.43
hf_transfer              0.1.6
hjson                    3.1.0
huggingface-hub          0.23.0
idna                     3.7
iniconfig                2.0.0
Jinja2                   3.1.4
markdown-it-py           3.0.0
MarkupSafe               2.1.5
mdurl                    0.1.2
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.1
ninja                    1.11.1.1
numpy                    1.24.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
packaging                24.0
pandas                   2.0.3
peft                     0.10.0
pillow                   10.3.0
pip                      24.0
platformdirs             4.2.1
pluggy                   1.5.0
protobuf                 3.20.1
psutil                   5.9.8
py-cpuinfo               9.0.0
pyarrow                  16.0.0
pyarrow-hotfix           0.6
pydantic                 2.7.1
pydantic_core            2.18.2
Pygments                 2.18.0
pynvml                   11.5.0
pytest                   8.2.0
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
regex                    2024.5.10
requests                 2.31.0
rich                     13.7.1
safetensors              0.4.3
scipy                    1.10.1
sentencepiece            0.2.0
sentry-sdk               2.1.1
setproctitle             1.3.3
setuptools               69.5.1
shtab                    1.7.1
six                      1.16.0
smmap                    5.0.1
sympy                    1.12
text-generation          0.7.0
tokenizers               0.19.1
tomli                    2.0.1
torch                    2.3.0
torchaudio               2.3.0
torchvision              0.18.0
tqdm                     4.66.4
transformers             4.40.2
triton                   2.3.0
trl                      0.8.6
typing_extensions        4.11.0
tyro                     0.8.4
tzdata                   2024.1
urllib3                  2.2.1
wandb                    0.17.0
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

Changing accelerate to accelerate<=0.29.3:
Package                  Version
------------------------ ---------------
accelerate               0.29.3
aiohttp                  3.9.5
aiosignal                1.3.1
annotated-types          0.6.0
async-timeout            4.0.3
attrs                    23.2.0
bitsandbytes             0.43.1
certifi                  2024.2.2
charset-normalizer       3.3.2
click                    8.1.7
datasets                 2.19.1
deepspeed                0.14.2+5f631abc
dill                     0.3.8
docker-pycreds           0.4.0
docstring_parser         0.16
einops                   0.8.0
eval_type_backport       0.2.0
exceptiongroup           1.2.1
filelock                 3.14.0
flash-attn               2.5.8
frozenlist               1.4.1
fsspec                   2024.3.1
gitdb                    4.0.11
GitPython                3.1.43
hf_transfer              0.1.6
hjson                    3.1.0
huggingface-hub          0.23.0
idna                     3.7
iniconfig                2.0.0
Jinja2                   3.1.4
markdown-it-py           3.0.0
MarkupSafe               2.1.5
mdurl                    0.1.2
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.1
ninja                    1.11.1.1
numpy                    1.24.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
packaging                24.0
pandas                   2.0.3
peft                     0.10.0
pillow                   10.3.0
pip                      24.0
platformdirs             4.2.1
pluggy                   1.5.0
protobuf                 3.20.1
psutil                   5.9.8
py-cpuinfo               9.0.0
pyarrow                  16.0.0
pyarrow-hotfix           0.6
pydantic                 2.7.1
pydantic_core            2.18.2
Pygments                 2.18.0
pynvml                   11.5.0
pytest                   8.2.0
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
regex                    2024.5.10
requests                 2.31.0
rich                     13.7.1
safetensors              0.4.3
scipy                    1.10.1
sentencepiece            0.2.0
sentry-sdk               2.1.1
setproctitle             1.3.3
setuptools               69.5.1
shtab                    1.7.1
six                      1.16.0
smmap                    5.0.1
sympy                    1.12
text-generation          0.7.0
tokenizers               0.19.1
tomli                    2.0.1
torch                    2.3.0
torchaudio               2.3.0
torchvision              0.18.0
tqdm                     4.66.4
transformers             4.40.2
triton                   2.3.0
trl                      0.8.6
typing_extensions        4.11.0
tyro                     0.8.4
tzdata                   2024.1
urllib3                  2.2.1
wandb                    0.17.0
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

I am using code based on the code here: https://github.com/mallorbc/Finetune_LLMs

Else, the basic steps are the following:

  1. Install the pip packages seen above, namely: pip install "accelerate<=0.29.3" pip install transformers accelerate peft bitsandbytes trl
  2. Use a QLora FSDP program
  3. Notice how errors occur with 0.3.0 but not 0.29.3

See an error like the following for 0.30.0:

[rank0]: Traceback (most recent call last):
[rank0]:   File "trl_finetune.py", line 387, in <module>
[rank0]:     trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank0]:     output = super().train(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2001, in _inner_training_loop
[rank0]:     self._fsdp_qlora_plugin_updates()
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 4425, in _fsdp_qlora_plugin_updates
[rank0]:     fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(self.model)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
[rank0]:     transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
[rank0]: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'
[rank1]: Traceback (most recent call last):
[rank1]:   File "trl_finetune.py", line 387, in <module>
[rank1]:     trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank1]:     output = super().train(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2001, in _inner_training_loop
[rank1]:     self._fsdp_qlora_plugin_updates()
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 4425, in _fsdp_qlora_plugin_updates
[rank1]:     fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(self.model)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
[rank1]:     transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
[rank1]: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'
E0510 12:16:25.853937 140644343273280 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 140) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1069, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
trl_finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-10_12:16:25
  host      : f61090d2a6fd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 141)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-10_12:16:25
  host      : f61090d2a6fd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 140)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected behavior

I expect training to occur without issues. This occurs when I use accelerate 0.29.3

mallorbc avatar May 10 '24 12:05 mallorbc