DeepSpeed [BUG] [Regression] Adam Offload Runtime Error with DeepSpeed v0.14.2

[BUG] [Regression] Adam Offload Runtime Error with DeepSpeed v0.14.2

Open hijkzzz opened this issue 9 months ago • 3 comments

It works well with DeepSpeed v0.14.0

Reproduce configs: only requires 1 A100 + LLaMA2 7B + Adam offload + ZeRO3 + NVIDIA NGC 23.12 container you can use the scripts from https://github.com/OpenLLMAI/OpenRLHF or any other training script such as the SFT script: https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh

logs:

 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!

Apr 26 '24 03:04 hijkzzz

Same problem, When I train Qwen1.5-32B Want to know why this happens

Apr 27 '24 14:04 muzhi1991

Same problem. I'm using training Mixtral8x7B using LLaMA-Factory on dual RTX 4090 with ZeRO3 + optim offload + param offload.

Apr 29 '24 03:04 QingyuanWang

Hi @hijkzzz / @muzhi1991 / @QingyuanWang - can you test with the latest code in master to see if this is resolved? We believe this was resolved in this commit, but would be good to confirm for your use case.

May 01 '24 16:05 loadams

I've faced the issue while running this example from Ray https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/deepspeed_configs/zero_3_llama_2_13b.json and I confirm it works with the reverted commit bc48371c5e1fb8fd70fc79285e66201dbb65679b. Thanks!

Jun 13 '24 15:06 astefanutti

DeepSpeed DeepSpeed copied to clipboard

[BUG] [Regression] Adam Offload Runtime Error with DeepSpeed v0.14.2

DeepSpeed
DeepSpeed copied to clipboard