Mingxiaoli comments

Results 4 comments of


                                            Mingxiaoli

当我使用DeepSpeed试图多卡训练时，出现错误Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

> 并没有，我尝试了别人的多卡训练的代码也是同样的错误，我开始是怀疑可能是环境问题，于是我重新配置了docker环境是python3.8,cuda11.0+nvidia的cuda11.6的兼容包,然后cupy是11.0的，其他的环境都重新配置了一遍，但是仍然是有这个错误，但是只用accelerate是能跑起来的，比单纯不用要快很多，可是显存主要是一张用的比较多。按道理Trainer内置了deepspeed跟accelerate，但是正常配置在TrainerAugments中并不起作用也报的这个错误。

当我使用DeepSpeed试图多卡训练时，出现错误Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

@Mr-lonely0 我解决了这个问题，但是解决的方式非常玄学。。。我的model是用autoModel下载到本地的，于是当我在一个新的容器上没用使用automodel下载过的模型去加载我共享盘下好的chatglm进行ddp的多卡训练时候，就会报这个错误，但是当我把模型modeling_chatglm跟tokenizer_chatglm源代码拉出来，然后改成ChatGLM的tokenizer跟model的类型再跑一遍会报一个维度错误的error，但是当我进行了以上两步之后再改回去AutoModel跟AutoTokenizer，神奇的是居然不报错了，顺利的ddp跑起来了。。这个流程我试了两遍，简直玄学。。。

How to reproduce the result of 59.8 on humaneval with your wizardcoder? The result I reproduced is only 47.

In fact, the prompts of humaneval-x and humaneval should be the same. I have checked and found that my generated parameters are still different from yours, because I did not...

CentOs 7 distutils.errors.CompileError: command '/usr/bin/gcc' failed with exit code 1

i fixed it by method from issue 627，i hope the people in the back don't step in the hole again