DeepSpeed
DeepSpeed copied to clipboard
[BUG] Tensors are on different devices when model.step()
Describe the bug The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.
My code
My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3
Thanks very much for your precious time!
I have the same issue here when running the following script. https://github.com/allenai/open-instruct/blob/main/scripts/dpo_train_with_accelerate.sh. I notice that version 0.14.0 does not have this issue.
I have the same issue with 0.14.1 when running a similar training script. Version 0.14.0 works. I cross-checked with torch 2.2.1, 2.2.2 and transformers 4.39.0, 4.39.3. The issue is with 0.14.1 across all combinations.
Here is a snippet of the backtrace (but I'm sorry that I cannot provide the python code. Hope it helps):
File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2169, in step
self._take_model_step(lr_kwargs)
File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.optimizer.step()
File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
@tjruwase @mrwyattii unscale_and_clip_grads
's last update was two years ago.
It might be the recent change in L2030 due to the use of norm()
: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L2030
The commit: https://github.com/microsoft/DeepSpeed/commit/54c06872647ca60699f752e60ac1643bd05aa63c
@wuxb45, @Heathcliff-Zhao, and @cloudwaysX thanks for reporting and triaging this issue.
Yes,I have the same issue when i use the deepspeed's version of 0.14.1, so I do that:
pip uninstall deepspeed
pip install deepspeed==0.14.0
after use the deepspeed of 0.14.0, it worked!
@Heathcliff-Zhao I am struggling to repro your code.
@jomayeri Can you show all files in thudm/chatglm-6b folder? Are there tokenizer related files in it?
@Heathcliff-Zhao there are no tokenizer files in it.
@jomayeri The folder should contain these files below. You can download the missing file from https://huggingface.co/THUDM/chatglm3-6b/tree/main
@Heathcliff-Zhao What is command to repro with a fresh checkout of your repo? --model_name_or_path ./opensource
is incorrect because that directory does not exist and specifying --model_name_or_path THUDM/chatglm3-6b
to download the model from huggingface also does not work.
Describe the bug The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.
My code My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with
CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3
Thanks very much for your precious time!
@jomayeri I use this command. ./opensource
in my command is a soft link of model path. You should modify it according to the place you save the model weights. I suggest you to double check whether there is a tokenizer.model
file in THUDM/chatglm3-6b.
Version 0.14.2 still has the same issue.
I can reproduce this in 0.14.2, it seems to have been reverted in #5461 the day after 0.14.2; so likely fixed in the next version 0.14.3.
Hi @kno10 - can you confirm that if you build from master that things work?
I did not try. I downgraded to 0.14.0 to get things back running as quickly as possible.
@kno10 - makes sense, thanks. Just wanted to confirm this fixed the issue you were hitting.
Closing with the same comment as https://github.com/microsoft/DeepSpeed/issues/5538.