ChatGLM-Tuning 当我使用DeepSpeed试图多卡训练时，出现错误Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Cupy Buffers Initialized Successfully. Pop out errors Finished the initialization step at rank 0 Pop out errors Finished the initialization step at rank 1 Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py310_cu116/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.07267284393310547 seconds Loading extension module utils... Time to load utils op: 0.1021263599395752 seconds ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /apdcephfs_cq3/share_1567347/share_info/mingxiaoli/chatglm_finetune_test/finetune.py:180 in │ │ │ │ │ │ 177 │ │ 178 │ │ 179 if name == "main": │ │ ❱ 180 │ main() │ │ 181 │ │ │ │ /apdcephfs_cq3/share_1567347/share_info/mingxiaoli/chatglm_finetune_test/finetune.py:173 in main │ │ │ │ 170 │ │ # callbacks=[TensorBoardCallback(writer)], │ │ 171 │ │ data_collator=data_collator, │ │ 172 │ ) # 初始化训练器，用于训练模型 │ │ ❱ 173 │ trainer.train() # 训练模型 │ │ 174 │ # writer.close() # 关闭TensorBoard写入器 │ │ 175 │ # save model │ │ 176 │ model.save_pretrained(training_args.output_dir) # 保存微调后的模型到指定目录 │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1662 in train │ │ │ │ 1659 │ │ inner_training_loop = find_executable_batch_size( │ │ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1661 │ │ ) │ │ ❱ 1662 │ │ return inner_training_loop( │ │ 1663 │ │ │ args=args, │ │ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1665 │ │ │ trial=trial, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1929 in _inner_training_loop │ │ │ │ 1926 │ │ │ │ │ with model.no_sync(): │ │ 1927 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1928 │ │ │ │ else: │ │ ❱ 1929 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1930 │ │ │ │ │ │ 1931 │ │ │ │ if ( │ │ 1932 │ │ │ │ │ args.logging_nan_inf_filter │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2699 in training_step │ │ │ │ 2696 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device) │ │ 2697 │ │ │ │ 2698 │ │ with self.compute_loss_context_manager(): │ │ ❱ 2699 │ │ │ loss = self.compute_loss(model, inputs) │ │ 2700 │ │ │ │ 2701 │ │ if self.args.n_gpu > 1: │ │ 2702 │ │ │ loss = loss.mean() # mean() to average on multi-gpu parallel training │ │ │ │ /apdcephfs_cq3/share_1567347/share_info/mingxiaoli/chatglm_finetune_test/finetune.py:86 in │ │ compute_loss │ │ │ │ 83 │ │ 84 class ModifiedTrainer(Trainer): │ │ 85 │ def compute_loss(self, model, inputs, return_outputs=False): │ │ ❱ 86 │ │ return model( │ │ 87 │ │ │ input_ids=inputs["input_ids"], │ │ 88 │ │ │ labels=inputs["labels"], │ │ 89 │ │ ).loss │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl │ │ │ │ 1191 │ │ # this function, and just call forward. │ │ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │ │ 1195 │ │ # Do not call functions when jit is used │ │ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │ │ │ │ 8 │ function call.""" │ │ 9 │ def wrapped_fn(*args, **kwargs): │ │ 10 │ │ get_accelerator().range_push(func.qualname) │ │ ❱ 11 │ │ ret_val = func(*args, **kwargs) │ │ 12 │ │ get_accelerator().range_pop() │ │ 13 │ │ return ret_val │ │ 14 │ │ │ │ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1846 in forward │ │ │ │ 1843 │ │ if self.fp16_auto_cast(): │ │ 1844 │ │ │ inputs = self._cast_inputs_half(inputs) │ │ 1845 │ │ │ │ ❱ 1846 │ │ loss = self.module(*inputs, **kwargs) │ │ 1847 │ │ │ │ 1848 │ │ if self.zero_optimization_partition_weights(): │ │ 1849 │ │ │ # Disable automated discovery of external parameters │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl │ │ │ │ 1191 │ │ # this function, and just call forward. │ │ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │ │ 1195 │ │ # Do not call functions when jit is used │ │ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /workspace/LLM-Adapters/peft/src/peft/peft_model.py:532 in forward │ │ │ │ 529 │ │ **kwargs, │ │ 530 │ ): │ │ 531 │ │ if not isinstance(self.peft_config, PromptLearningConfig): │ │ ❱ 532 │ │ │ return self.base_model( │ │ 533 │ │ │ │ input_ids=input_ids, │ │ 534 │ │ │ │ attention_mask=attention_mask, │ │ 535 │ │ │ │ inputs_embeds=inputs_embeds, │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl │ │ │ │ 1191 │ │ # this function, and just call forward. │ │ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │ │ 1195 │ │ # Do not call functions when jit is used │ │ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py:1160 in │ │ forward │ │ │ │ 1157 │ │ use_cache = use_cache if use_cache is not None else self.config.use_cache │ │ 1158 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │ │ 1159 │ │ │ │ ❱ 1160 │ │ transformer_outputs = self.transformer( │ │ 1161 │ │ │ input_ids=input_ids, │ │ 1162 │ │ │ position_ids=position_ids, │ │ 1163 │ │ │ attention_mask=attention_mask, │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl │ │ │ │ 1191 │ │ # this function, and just call forward. │ │ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │ │ 1195 │ │ # Do not call functions when jit is used │ │ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py:907 in │ │ forward │ │ │ │ 904 │ │ │ raise ValueError("You have to specify either input_ids or inputs_embeds") │ │ 905 │ │ │ │ 906 │ │ if inputs_embeds is None: │ │ ❱ 907 │ │ │ inputs_embeds = self.word_embeddings(input_ids) │ │ 908 │ │ │ │ 909 │ │ if past_key_values is None: │ │ 910 │ │ │ if self.pre_seq_len is not None: │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1212 in _call_impl │ │ │ │ 1209 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │ │ 1210 │ │ │ input = bw_hook.setup_input_hook(input) │ │ 1211 │ │ │ │ ❱ 1212 │ │ result = forward_call(input, **kwargs) │ │ 1213 │ │ if _global_forward_hooks or self._forward_hooks: │ │ 1214 │ │ │ for hook in (_global_forward_hooks.values(), *self.forward_hooks.values()) │ │ 1215 │ │ │ │ hook_result = hook(self, input, result) │ │ │ │ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module.hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/sparse.py:160 in forward │ │ │ │ 157 │ │ │ │ self.weight[self.padding_idx].fill(0) │ │ 158 │ │ │ 159 │ def forward(self, input: Tensor) -> Tensor: │ │ ❱ 160 │ │ return F.embedding( │ │ 161 │ │ │ input, self.weight, self.padding_idx, self.max_norm, │ │ 162 │ │ │ self.norm_type, self.scale_grad_by_freq, self.sparse) │ │ 163 │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:2210 in embedding │ │ │ │ 2207 │ │ # torch.embedding_renorm │ │ 2208 │ │ # remove once script supports set_grad_enabled │ │ 2209 │ │ no_grad_embedding_renorm(weight, input, max_norm, norm_type) │ │ ❱ 2210 │ return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) │ │ 2211 │ │ 2212 │ │ 2213 def embedding_bag( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select) [2023-04-08 16:00:51,018] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 9721 [2023-04-08 16:00:51,398] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 9722 [2023-04-08 16:00:51,399] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune.py', '--local_rank=1'] exits with return code = 1

Apr 08 '23 08:04 lmx760581375

我也遇到了这个问题，请问你解决了吗

Apr 09 '23 09:04 Mr-lonely0

并没有，我尝试了别人的多卡训练的代码也是同样的错误，我开始是怀疑可能是环境问题，于是我重新配置了docker环境是python3.8,cuda11.0+nvidia的cuda11.6的兼容包,然后cupy是11.0的，其他的环境都重新配置了一遍，但是仍然是有这个错误，但是只用accelerate是能跑起来的，比单纯不用要快很多，可是显存主要是一张用的比较多。按道理Trainer内置了deepspeed跟accelerate，但是正常配置在TrainerAugments中并不起作用也报的这个错误。

Apr 09 '23 11:04 lmx760581375

@Mr-lonely0 我解决了这个问题，但是解决的方式非常玄学。。。

我的model是用autoModel下载到本地的，于是当我在一个新的容器上没用使用automodel下载过的模型去加载我共享盘下好的chatglm进行ddp的多卡训练时候，就会报这个错误，但是当我把模型modeling_chatglm跟tokenizer_chatglm源代码拉出来，然后改成ChatGLM的tokenizer跟model的类型再跑一遍会报一个维度错误的error，但是当我进行了以上两步之后再改回去AutoModel跟AutoTokenizer，神奇的是居然不报错了，顺利的ddp跑起来了。。

这个流程我试了两遍，简直玄学。。。

Apr 10 '23 05:04 lmx760581375

@Mr-lonely0 我解决了这个问题，但是解决的方式非常玄学。。。

我的model是用autoModel下载到本地的，于是当我在一个新的容器上没用使用automodel下载过的模型去加载我共享盘下好的chatglm进行ddp的多卡训练时候，就会报这个错误，但是当我把模型modeling_chatglm跟tokenizer_chatglm源代码拉出来，然后改成ChatGLM的tokenizer跟model的类型再跑一遍会报一个维度错误的error，但是当我进行了以上两步之后再改回去AutoModel跟AutoTokenizer，神奇的是居然不报错了，顺利的ddp跑起来了。。

这个流程我试了两遍，简直玄学。。。

chatglm好像是不支持层并行的