accelerate
accelerate copied to clipboard
Accelerate 0.30.0 Breaks FSDP QLora
System Info
See below a pip list output that does not work:
Package Version
------------------------ ---------------
accelerate 0.30.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.6.0
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
datasets 2.19.1
deepspeed 0.14.2+5f631abc
dill 0.3.8
docker-pycreds 0.4.0
docstring_parser 0.16
einops 0.8.0
eval_type_backport 0.2.0
exceptiongroup 1.2.1
filelock 3.14.0
flash-attn 2.5.8
frozenlist 1.4.1
fsspec 2024.3.1
gitdb 4.0.11
GitPython 3.1.43
hf_transfer 0.1.6
hjson 3.1.0
huggingface-hub 0.23.0
idna 3.7
iniconfig 2.0.0
Jinja2 3.1.4
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.1
ninja 1.11.1.1
numpy 1.24.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
packaging 24.0
pandas 2.0.3
peft 0.10.0
pillow 10.3.0
pip 24.0
platformdirs 4.2.1
pluggy 1.5.0
protobuf 3.20.1
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 16.0.0
pyarrow-hotfix 0.6
pydantic 2.7.1
pydantic_core 2.18.2
Pygments 2.18.0
pynvml 11.5.0
pytest 8.2.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.5.10
requests 2.31.0
rich 13.7.1
safetensors 0.4.3
scipy 1.10.1
sentencepiece 0.2.0
sentry-sdk 2.1.1
setproctitle 1.3.3
setuptools 69.5.1
shtab 1.7.1
six 1.16.0
smmap 5.0.1
sympy 1.12
text-generation 0.7.0
tokenizers 0.19.1
tomli 2.0.1
torch 2.3.0
torchaudio 2.3.0
torchvision 0.18.0
tqdm 4.66.4
transformers 4.40.2
triton 2.3.0
trl 0.8.6
typing_extensions 4.11.0
tyro 0.8.4
tzdata 2024.1
urllib3 2.2.1
wandb 0.17.0
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4
Changing accelerate to accelerate<=0.29.3:
Package Version
------------------------ ---------------
accelerate 0.29.3
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.6.0
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
datasets 2.19.1
deepspeed 0.14.2+5f631abc
dill 0.3.8
docker-pycreds 0.4.0
docstring_parser 0.16
einops 0.8.0
eval_type_backport 0.2.0
exceptiongroup 1.2.1
filelock 3.14.0
flash-attn 2.5.8
frozenlist 1.4.1
fsspec 2024.3.1
gitdb 4.0.11
GitPython 3.1.43
hf_transfer 0.1.6
hjson 3.1.0
huggingface-hub 0.23.0
idna 3.7
iniconfig 2.0.0
Jinja2 3.1.4
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.1
ninja 1.11.1.1
numpy 1.24.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
packaging 24.0
pandas 2.0.3
peft 0.10.0
pillow 10.3.0
pip 24.0
platformdirs 4.2.1
pluggy 1.5.0
protobuf 3.20.1
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 16.0.0
pyarrow-hotfix 0.6
pydantic 2.7.1
pydantic_core 2.18.2
Pygments 2.18.0
pynvml 11.5.0
pytest 8.2.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.5.10
requests 2.31.0
rich 13.7.1
safetensors 0.4.3
scipy 1.10.1
sentencepiece 0.2.0
sentry-sdk 2.1.1
setproctitle 1.3.3
setuptools 69.5.1
shtab 1.7.1
six 1.16.0
smmap 5.0.1
sympy 1.12
text-generation 0.7.0
tokenizers 0.19.1
tomli 2.0.1
torch 2.3.0
torchaudio 2.3.0
torchvision 0.18.0
tqdm 4.66.4
transformers 4.40.2
triton 2.3.0
trl 0.8.6
typing_extensions 4.11.0
tyro 0.8.4
tzdata 2024.1
urllib3 2.2.1
wandb 0.17.0
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
I am using code based on the code here: https://github.com/mallorbc/Finetune_LLMs
Else, the basic steps are the following:
- Install the pip packages seen above, namely: pip install "accelerate<=0.29.3" pip install transformers accelerate peft bitsandbytes trl
- Use a QLora FSDP program
- Notice how errors occur with 0.3.0 but not 0.29.3
See an error like the following for 0.30.0:
[rank0]: Traceback (most recent call last):
[rank0]: File "trl_finetune.py", line 387, in <module>
[rank0]: trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank0]: output = super().train(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2001, in _inner_training_loop
[rank0]: self._fsdp_qlora_plugin_updates()
[rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 4425, in _fsdp_qlora_plugin_updates
[rank0]: fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(self.model)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
[rank0]: transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
[rank0]: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'
[rank1]: Traceback (most recent call last):
[rank1]: File "trl_finetune.py", line 387, in <module>
[rank1]: trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank1]: output = super().train(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
[rank1]: return inner_training_loop(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2001, in _inner_training_loop
[rank1]: self._fsdp_qlora_plugin_updates()
[rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 4425, in _fsdp_qlora_plugin_updates
[rank1]: fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(self.model)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
[rank1]: transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
[rank1]: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'
E0510 12:16:25.853937 140644343273280 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 140) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1069, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
trl_finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-05-10_12:16:25
host : f61090d2a6fd
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 141)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-10_12:16:25
host : f61090d2a6fd
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 140)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Expected behavior
I expect training to occur without issues. This occurs when I use accelerate 0.29.3