DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] [Regression] Adam Offload Runtime Error with DeepSpeed v0.14.2

Open hijkzzz opened this issue 9 months ago • 3 comments

It works well with DeepSpeed v0.14.0

Reproduce configs: only requires 1 A100 + LLaMA2 7B + Adam offload + ZeRO3 + NVIDIA NGC 23.12 container you can use the scripts from https://github.com/OpenLLMAI/OpenRLHF or any other training script such as the SFT script: https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh

logs:

 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!

image

hijkzzz avatar Apr 26 '24 03:04 hijkzzz

Same problem, When I train Qwen1.5-32B Want to know why this happens

muzhi1991 avatar Apr 27 '24 14:04 muzhi1991

Same problem. I'm using training Mixtral8x7B using LLaMA-Factory on dual RTX 4090 with ZeRO3 + optim offload + param offload.

QingyuanWang avatar Apr 29 '24 03:04 QingyuanWang

Hi @hijkzzz / @muzhi1991 / @QingyuanWang - can you test with the latest code in master to see if this is resolved? We believe this was resolved in this commit, but would be good to confirm for your use case.

loadams avatar May 01 '24 16:05 loadams

I've faced the issue while running this example from Ray https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/deepspeed_configs/zero_3_llama_2_13b.json and I confirm it works with the reverted commit bc48371c5e1fb8fd70fc79285e66201dbb65679b. Thanks!

astefanutti avatar Jun 13 '24 15:06 astefanutti