zero_nlp icon indicating copy to clipboard operation
zero_nlp copied to clipboard

单机多卡训练chat_glm 有误

Open cxj01 opened this issue 1 year ago • 6 comments

用仓库代码,虽然电脑上有两块GPU,但是还是加载一块GPU,如果指定各个层在不同GPU上,会报Tensor不在一个device上的错误。

cxj01 avatar Apr 20 '23 06:04 cxj01

注意模型代码,使用我提供的代码

yuanzhoulvpi2017 avatar Apr 20 '23 06:04 yuanzhoulvpi2017

@yuanzhoulvpi2017 我就是完全使用本仓库的代码,只是我只把最后两层放到了另一个GPU上 image image

cxj01 avatar Apr 20 '23 06:04 cxj01

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

yuanzhoulvpi2017 avatar Apr 20 '23 07:04 yuanzhoulvpi2017

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码,但是报错,同上,RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

YSLLYW avatar Apr 30 '23 07:04 YSLLYW

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码,但是报错,同上,RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

样例跑不通

YSLLYW avatar Apr 30 '23 07:04 YSLLYW

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码,但是报错,同上,RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

定位到那一层,把input.to(和他的weights一样的设备,就能跑过)

Ardang666 avatar Jul 01 '23 05:07 Ardang666