DeepSpeed
DeepSpeed copied to clipboard
[BUG] [Regression] Adam Offload Runtime Error with DeepSpeed v0.14.2
It works well with DeepSpeed v0.14.0
Reproduce configs: only requires 1 A100 + LLaMA2 7B + Adam offload + ZeRO3 + NVIDIA NGC 23.12 container you can use the scripts from https://github.com/OpenLLMAI/OpenRLHF or any other training script such as the SFT script: https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh
logs:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!
Same problem, When I train Qwen1.5-32B Want to know why this happens
Same problem. I'm using training Mixtral8x7B using LLaMA-Factory on dual RTX 4090 with ZeRO3 + optim offload + param offload.
Hi @hijkzzz / @muzhi1991 / @QingyuanWang - can you test with the latest code in master to see if this is resolved? We believe this was resolved in this commit, but would be good to confirm for your use case.
I've faced the issue while running this example from Ray https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/deepspeed_configs/zero_3_llama_2_13b.json and I confirm it works with the reverted commit bc48371c5e1fb8fd70fc79285e66201dbb65679b. Thanks!