DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Tensors are on different devices when model.step()

Open yuezhao238 opened this issue 10 months ago • 16 comments

Describe the bug The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.

My code My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3

Thanks very much for your precious time!

yuezhao238 avatar Apr 16 '24 18:04 yuezhao238

I have the same issue here when running the following script. https://github.com/allenai/open-instruct/blob/main/scripts/dpo_train_with_accelerate.sh. I notice that version 0.14.0 does not have this issue.

cloudwaysX avatar Apr 17 '24 02:04 cloudwaysX

I have the same issue with 0.14.1 when running a similar training script. Version 0.14.0 works. I cross-checked with torch 2.2.1, 2.2.2 and transformers 4.39.0, 4.39.3. The issue is with 0.14.1 across all combinations.

Here is a snippet of the backtrace (but I'm sorry that I cannot provide the python code. Hope it helps):

  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2169, in step
    self._take_model_step(lr_kwargs)
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
    self.optimizer.step()
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
    self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/work/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
    self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

wuxb45 avatar Apr 17 '24 08:04 wuxb45

@tjruwase @mrwyattii unscale_and_clip_grads's last update was two years ago. It might be the recent change in L2030 due to the use of norm(): https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L2030

The commit: https://github.com/microsoft/DeepSpeed/commit/54c06872647ca60699f752e60ac1643bd05aa63c

wuxb45 avatar Apr 17 '24 08:04 wuxb45

@wuxb45, @Heathcliff-Zhao, and @cloudwaysX thanks for reporting and triaging this issue.

tjruwase avatar Apr 17 '24 15:04 tjruwase

Yes,I have the same issue when i use the deepspeed's version of 0.14.1, so I do that:

pip uninstall deepspeed 
pip install deepspeed==0.14.0

after use the deepspeed of 0.14.0, it worked!

Kwen-Chen avatar Apr 18 '24 06:04 Kwen-Chen

@Heathcliff-Zhao I am struggling to repro your code. image

jomayeri avatar Apr 18 '24 22:04 jomayeri

@jomayeri Can you show all files in thudm/chatglm-6b folder? Are there tokenizer related files in it?

yuezhao238 avatar Apr 19 '24 04:04 yuezhao238

@Heathcliff-Zhao there are no tokenizer files in it.

jomayeri avatar Apr 19 '24 18:04 jomayeri

@jomayeri The folder should contain these files below. You can download the missing file from https://huggingface.co/THUDM/chatglm3-6b/tree/main image

yuezhao238 avatar Apr 20 '24 04:04 yuezhao238

@Heathcliff-Zhao What is command to repro with a fresh checkout of your repo? --model_name_or_path ./opensource is incorrect because that directory does not exist and specifying --model_name_or_path THUDM/chatglm3-6b to download the model from huggingface also does not work.

jomayeri avatar Apr 22 '24 20:04 jomayeri

Describe the bug The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.

My code My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3

Thanks very much for your precious time!

@jomayeri I use this command. ./opensource in my command is a soft link of model path. You should modify it according to the place you save the model weights. I suggest you to double check whether there is a tokenizer.model file in THUDM/chatglm3-6b.

yuezhao238 avatar Apr 22 '24 21:04 yuezhao238

Version 0.14.2 still has the same issue.

wuxb45 avatar Apr 25 '24 14:04 wuxb45

I can reproduce this in 0.14.2, it seems to have been reverted in #5461 the day after 0.14.2; so likely fixed in the next version 0.14.3.

kno10 avatar May 02 '24 10:05 kno10

Hi @kno10 - can you confirm that if you build from master that things work?

loadams avatar May 06 '24 16:05 loadams

I did not try. I downgraded to 0.14.0 to get things back running as quickly as possible.

kno10 avatar May 06 '24 18:05 kno10

@kno10 - makes sense, thanks. Just wanted to confirm this fixed the issue you were hitting.

loadams avatar May 07 '24 17:05 loadams

Closing with the same comment as https://github.com/microsoft/DeepSpeed/issues/5538.

jomayeri avatar May 22 '24 18:05 jomayeri