ZeRO-3 enables weight sharding on multiple GPUs

python3.10 -m pip install "bitsandbytes>=0.43.0" deepspeed --num_gpus 4 ../../src/train.py
--deepspeed ../deepspeed/ds_z3_offload_config.json
--stage sft
--do_train
--model_name_or_path ../../model/qwen/Qwen1.5-110B-Chat
--dataset Kee_Instruction_NewEstabalish
--dataset_dir ../../data
--template qwen
--lora_rank 128
--lora_alpha 256
--lora_target all
--output_dir ../../saves/Qwen/lora/sft
--overwrite_cache
--overwrite_output_dir
--cutoff_len 6000
--preprocessing_num_workers 16
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 200
--eval_steps 200
--evaluation_strategy steps
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 3000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16

ds_z3_offload_config.json的配置如下： train_batch_size"auto" train_micro_batch_size_per_gpu"auto" gradient_accumulation_steps"auto" gradient_clipping"auto" zero_allow_untested_optimizertrue enabled"auto" loss_scale0 loss_scale_window1000 initial_scale_power16 hysteresis2 min_loss_scale1 enabled"auto" stage3 device"cpu" pin_memorytrue device"cpu" pin_memorytrue overlap_commtrue contiguous_gradientstrue sub_group_size1000000000 reduce_bucket_size"auto" stage3_prefetch_bucket_size"auto" stage3_param_persistence_threshold"auto" stage3_max_live_parameters1000000000 stage3_max_reuse_distance1000000000 stage3_gather_16bit_weights_on_model_savetrue

Expected behavior

期望能够利用4 * A40 （192GB）+ deepspeed +zero3微调Qwen1.5-110B

System Info

(base) root@I19c2837ff800901ccf:/# python3.10 -m pip list Package Version

accelerate 0.28.0 addict 2.4.0 aiofiles 23.2.1 aiohttp 3.9.3 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.0 aliyun-python-sdk-kms 2.16.2 altair 5.2.0 annotated-types 0.6.0 anyio 4.3.0 async-timeout 4.0.3 attrs 23.2.0 auto_gptq 0.7.1 bitsandbytes 0.43.0 certifi 2019.11.28 cffi 1.16.0 chardet 3.0.4 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 contourpy 1.2.0 crcmod 1.7 cryptography 42.0.5 cupy-cuda12x 12.1.0 cycler 0.12.1 datasets 2.18.0 dbus-python 1.2.16 deepspeed 0.14.2 dill 0.3.8 diskcache 5.6.3 distro 1.4.0 distro-info 0.23ubuntu1 docstring_parser 0.16 einops 0.7.0 exceptiongroup 1.2.0 fastapi 0.110.0 fastrlock 0.8.2 ffmpy 0.3.2 filelock 3.13.3 fire 0.6.0 fonttools 4.50.0 frozenlist 1.4.1 fsspec 2024.2.0 galore-torch 1.0 gast 0.5.4 gekko 1.0.7 gradio 3.50.2 gradio_client 0.6.1 h11 0.14.0 hjson 3.1.0 httpcore 1.0.4 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.22.0 idna 2.8 importlib_metadata 7.1.0 importlib_resources 6.4.0 interegular 0.3.3 Jinja2 3.1.3 jmespath 0.10.0 joblib 1.3.2 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 kiwisolver 1.4.5 lark 1.1.9 llvmlite 0.42.0 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.8.3 mdurl 0.1.2 modelscope 1.13.3 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.2.1 ninja 1.11.1.1 numba 0.59.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.4.99 nvidia-nvtx-cu12 12.1.105 orjson 3.9.15 oss2 2.18.4 outlines 0.0.37 packaging 24.0 pandas 2.2.1 peft 0.10.0 pillow 10.2.0 pip 24.0 platformdirs 4.2.0 prometheus_client 0.20.0 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pycparser 2.21 pycryptodome 3.20.0 pydantic 2.6.4 pydantic_core 2.16.3 pydub 0.25.1 Pygments 2.17.2 PyGObject 3.36.0 pynvml 11.5.0 pyparsing 3.1.2 python-apt 2.0.1+ubuntu0.20.4.1 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 ray 2.10.0 referencing 0.34.0 regex 2023.12.25 requests 2.31.0 requests-unixsocket 0.2.0 rich 13.7.1 rouge 1.0.1 rpds-py 0.18.0 safetensors 0.4.2 scipy 1.12.0 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 69.2.0 shtab 1.7.1 simplejson 3.19.2 six 1.14.0 sniffio 1.3.1 sortedcontainers 2.4.0 sse-starlette 2.0.0 ssh-import-id 5.10 starlette 0.36.3 sympy 1.12 termcolor 2.4.0 tokenizers 0.15.2 tomli 2.0.1 toolz 0.12.1 torch 2.1.2 tqdm 4.66.2 transformers 4.39.1 triton 2.1.0 trl 0.8.1 typing_extensions 4.10.0 tyro 0.7.3 tzdata 2024.1 unattended-upgrades 0.1 urllib3 2.2.1 uvicorn 0.29.0 uvloop 0.19.0 vllm 0.3.3 watchfiles 0.21.0 websockets 11.0.3 wheel 0.34.2 xformers 0.0.23.post1 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.18.1

Others

报错信息如下： .... File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1995, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 175, in backward self.engine.step() File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step self._take_model_step(lr_kwargs) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in take_model_step self.optimizer.step() File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm) File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads self.fp32_partitioned_groups_flat[sub_group_id].grad.mul(1. / combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 0%| | 0/1011 [02:39<?, ?it/s] [2024-05-06 04:36:08,442] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2728 [2024-05-06 04:36:29,982] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2729 [2024-05-06 04:36:29,984] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2730 [2024-05-06 04:36:29,985] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2731 [2024-05-06 04:36:29,986] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3.10', ' ...

May 06 '24 04:05 camposs1979

嗨，我在 tune llama-2 的時候遇到了一樣的問題，我把 deepspeed downgrade 到 0.14.0 就可以跑了！

May 06 '24 10:05 hank0316

嗨，我在 tune llama-2 的時候遇到了一樣的問題，我把 deepspeed downgrade 到 0.14.0 就可以跑了！

非常感谢您的回复，我马上回退版本尝试一下。

May 06 '24 10:05 camposs1979

嗨，我在 tune llama-2 的時候遇到了一樣的問題，我把 deepspeed downgrade 到 0.14.0 就可以跑了！

已经验证，确实可行，感谢感谢

May 06 '24 10:05 camposs1979

嗨，我在 tune llama-2 的時候遇到了一樣的問題，我把 deepspeed downgrade 到 0.14.0 就可以跑了！

感谢，可行

May 22 '24 07:05 CSZHK

LLaMA-Factory
LLaMA-Factory copied to clipboard

最新的代码，deepspeed (0.14.2)4*A40 采用deepspeed zero3_offload，调试Qwen110B，系统报错：RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

Reminder

Reproduction

ZeRO-3 enables weight sharding on multiple GPUs

Expected behavior

System Info

Others

LLaMA-Factory LLaMA-Factory copied to clipboard

最新的代码，deepspeed (0.14.2)4*A40 采用deepspeed zero3_offload，调试Qwen110B，系统报错：RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

Reminder

Reproduction

ZeRO-3 enables weight sharding on multiple GPUs

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard