[BUG]: layer norm error
🐛 Describe the bug
使用huggingface的模型接口,用gpt2的gemini demo跑glm-chinese-10b,修改了tensor_parallel函数和glm的modeling_glm.py来适配colossal,发现layer norm会报RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096] 运行脚本如下:
set -x
# distplan in ["CAI_ZeRO1", "CAI_ZeRO2", "CAI_Gemini", "Pytorch_DDP", "Pytorch_ZeRO"]
export DISTPLAN=${DISTPLAN:-"CAI_Gemini"}
# The following options only valid when DISTPLAN="colossalai"
export GPUNUM=${GPUNUM:-4}
export TPDEGREE=${TPDEGREE:-2}
export PLACEMENT=${PLACEMENT:-"cpu"}
export USE_SHARD_INIT=${USE_SHARD_INIT:-False}
export BATCH_SIZE=${BATCH_SIZE:-8}
export MODEL_TYPE=${MODEL_TYPE:-"gpt2_medium"}
export TRAIN_STEP=${TRAIN_STEP:-10}
# export PYTHONPATH=$PWD:$PYTHONPATH
if [ ${USE_SHARD_INIT} = "True" ]; then
USE_SHARD_INIT="--shardinit"
else
USE_SHARD_INIT=""
fi
mkdir -p gemini_logs
torchrun --standalone --nproc_per_node=${GPUNUM} ./train_glm_demo.py \
--tp_degree=${TPDEGREE} \
--model_type=${MODEL_TYPE} \
--batch_size=${BATCH_SIZE} \
--placement=${PLACEMENT} \
${USE_SHARD_INIT} \
--distplan=${DISTPLAN} \
--train_step=${TRAIN_STEP} \
2>&1 | tee ./gemini_logs/${MODEL_TYPE}_${DISTPLAN}_gpu_${GPUNUM}_bs_${BATCH_SIZE}_tp_${TPDEGREE}_${PLACEMENT}.log
traceback如下:
+ export DISTPLAN=CAI_Gemini
+ DISTPLAN=CAI_Gemini
+ export GPUNUM=4
+ GPUNUM=4
+ export TPDEGREE=2
+ TPDEGREE=2
+ export PLACEMENT=cpu
+ PLACEMENT=cpu
+ export USE_SHARD_INIT=False
+ USE_SHARD_INIT=False
+ export BATCH_SIZE=8
+ BATCH_SIZE=8
+ export MODEL_TYPE=gpt2_medium
+ MODEL_TYPE=gpt2_medium
+ export TRAIN_STEP=10
+ TRAIN_STEP=10
+ '[' False = True ']'
+ USE_SHARD_INIT=
+ mkdir -p gemini_logs
+ tee ./gemini_logs/gpt2_medium_CAI_Gemini_gpu_4_bs_8_tp_2_cpu.log
+ torchrun --standalone --nproc_per_node=4 ./train_glm_demo.py --tp_degree=2 --model_type=gpt2_medium --batch_size=8 --placement=cpu --distplan=CAI_Gemini --train_step=10
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
environmental variable OMP_NUM_THREADS is set to 20.
environmental variable OMP_NUM_THREADS is set to 20.
environmental variable OMP_NUM_THREADS is set to 20.
environmental variable OMP_NUM_THREADS is set to 20.
[03/01/23 10:39:28] INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
21 set_device
[03/01/23 10:39:28] INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
21 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[03/01/23 10:39:28] INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
21 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
[03/01/23 10:39:28] INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
21 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[03/01/23 10:39:31] INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
57 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
[03/01/23 10:39:31] INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
57 set_seed
[03/01/23 10:39:31] INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
57 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 4, pipeline
parallel size: 1, tensor parallel size: 1
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
INFO colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:221 main
INFO colossalai - colossalai - INFO: gpt2_medium, CAI_Gemini, batch size 8
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[03/01/23 10:39:31] INFO colossalai - colossalai - INFO:
/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
57 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
************************************************** load model finish **************************************************
************************************************** start ProcessGroup **************************************************
************************************************** load model finish **************************************************
************************************************** start ProcessGroup **************************************************
************************************************** load model finish **************************************************
************************************************** start ProcessGroup **************************************************
************************************************** load model finish **************************************************
************************************************** start ProcessGroup **************************************************
************************************************** start tensor parallel **************************************************
************************************************** start tensor parallel **************************************************
************************************************** start tensor parallel **************************************************
************************************************** start tensor parallel **************************************************
************************************************** build adam **************************************************
************************************************** build adam **************************************************
************************************************** build adam **************************************************
************************************************** build adam **************************************************
=========================================================================================
No pre-built kernel is found, build and load the cpu_adam kernel during runtime now
=========================================================================================
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
************************************************** zero model process **************************************************
Loading extension module cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
************************************************** zero model process **************************************************
Loading extension module cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
************************************************** zero model process **************************************************
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.284288167953491 seconds
=========================================================================================
No pre-built kernel is found, build and load the fused_optim kernel during runtime now
=========================================================================================
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
Time to load fused_optim op: 0.39287424087524414 seconds
************************************************** zero model process **************************************************
searching chunk configuration is completed in 3.11 s.
used number: 4711.35 MB, wasted number: 28.66 MB
total wasted percentage is 0.60%
************************************************** zero optim process **************************************************
************************************************** load pack parallel **************************************************
[03/01/23 10:41:16] INFO colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:307 main
INFO colossalai - colossalai - INFO: the size of testing model size is 4.9B.
load tokenizer
************************************************** zero optim process **************************************************
************************************************** load pack parallel **************************************************
[03/01/23 10:41:17] INFO colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:291 main
INFO colossalai - colossalai - INFO: After init optim, GPU memory usage: 0.00 MB, CPU memory usage: 29621.83 MB
************************************************** zero optim process **************************************************
INFO colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:307 main
INFO colossalai - colossalai - INFO: the size of testing model size is 4.9B.
INFO colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:308 main
INFO colossalai - colossalai - INFO: After init model, GPU memory usage: 0.00 MB, CPU memory usage: 29621.83 MB
load tokenizer
************************************************** zero optim process **************************************************
************************************************** load pack parallel **************************************************
[03/01/23 10:41:17] INFO colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:307 main
INFO colossalai - colossalai - INFO: the size of testing model size is 4.9B.
load tokenizer
************************************************** load pack parallel **************************************************
[03/01/23 10:41:17] INFO colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:307 main
INFO colossalai - colossalai - INFO: the size of testing model size is 4.9B.
load tokenizer
Traceback (most recent call last):
File "/home/colossal-example/glm/./train_glm_demo.py", line 427, in <module>
main()
File "/home/colossal-example/glm/./train_glm_demo.py", line 379, in main
train_step(batch)
File "/home/colossal-example/glm/./train_glm_demo.py", line 331, in train_step
outputs = model(**inputs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
Traceback (most recent call last):
File "/home/colossal-example/glm/./train_glm_demo.py", line 427, in <module>
Traceback (most recent call last):
File "/home/colossal-example/glm/./train_glm_demo.py", line 427, in <module>
Traceback (most recent call last):
File "/home/colossal-example/glm/./train_glm_demo.py", line 427, in <module>
main()
File "/home/colossal-example/glm/./train_glm_demo.py", line 379, in main
main()
File "/home/colossal-example/glm/./train_glm_demo.py", line 379, in main
main()
File "/home/colossal-example/glm/./train_glm_demo.py", line 379, in main
train_step(batch)
File "/home/colossal-example/glm/./train_glm_demo.py", line 331, in train_step
train_step(batch)
File "/home/colossal-example/glm/./train_glm_demo.py", line 331, in train_step
outputs = model(**inputs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
outputs = model(**inputs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
train_step(batch)
File "/home/colossal-example/glm/./train_glm_demo.py", line 331, in train_step
outputs = model(**inputs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward
return forward_call(*input, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward
return forward_call(*input, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward
return forward_call(*input, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward
outputs = self.module(*args, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
outputs = self.module(*args, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 904, in forward
model_output = self.glm.forward(input_ids, position_ids, attention_mask, mems=mems, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 793, in forward
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 904, in forward
transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 605, in forward
model_output = self.glm.forward(input_ids, position_ids, attention_mask, mems=mems, **kwargs)
hidden_states = layer(*args, mem=mem_i)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
File "/home/colossal-example/glm/modeling_glm.py", line 793, in forward
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 425, in forward
layernorm_output = self.input_layernorm(hidden_states)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 605, in forward
hidden_states = layer(*args, mem=mem_i)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 425, in forward
layernorm_output = self.input_layernorm(hidden_states)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
outputs = self.module(*args, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 904, in forward
model_output = self.glm.forward(input_ids, position_ids, attention_mask, mems=mems, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 793, in forward
transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 605, in forward
hidden_states = layer(*args, mem=mem_i)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
outputs = self.module(*args, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 425, in forward
layernorm_output = self.input_layernorm(hidden_states)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 904, in forward
return forward_call(*input, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
model_output = self.glm.forward(input_ids, position_ids, attention_mask, mems=mems, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 793, in forward
transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 605, in forward
hidden_states = layer(*args, mem=mem_i)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/colossal-example/glm/modeling_glm.py", line 425, in forward
layernorm_output = self.input_layernorm(hidden_states)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
return F.layer_norm(
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
return F.layer_norm(
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
return F.layer_norm(
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
return handle_torch_function(
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
return handle_torch_function(
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
return handle_torch_function(
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
return handle_torch_function(
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 87, in __torch_function__
result = torch_func_method(public_api, types, args, kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 87, in __torch_function__
result = torch_func_method(public_api, types, args, kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 87, in __torch_function__
ret = super().__torch_function__(func, types, args, kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 184, in __torch_function__
ret = func(*args, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/_ops/layernorm.py", line 21, in colo_layernorm
ret = super().__torch_function__(func, types, args, kwargs)ret = super().__torch_function__(func, types, args, kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 184, in __torch_function__
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 184, in __torch_function__
ret = func(*args, **kwargs)
ret = func(*args, **kwargs) File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/_ops/layernorm.py", line 21, in colo_layernorm
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/_ops/layernorm.py", line 21, in colo_layernorm
result = torch_func_method(public_api, types, args, kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 87, in __torch_function__
ret = super().__torch_function__(func, types, args, kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 184, in __torch_function__
ret = func(*args, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/_ops/layernorm.py", line 21, in colo_layernorm
output = F.layer_norm(input_tensor, normalized_shape, weight=weight, bias=bias, eps=eps)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
output = F.layer_norm(input_tensor, normalized_shape, weight=weight, bias=bias, eps=eps)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096]
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096]
output = F.layer_norm(input_tensor, normalized_shape, weight=weight, bias=bias, eps=eps)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096]
output = F.layer_norm(input_tensor, normalized_shape, weight=weight, bias=bias, eps=eps)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 137265 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 137267 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 137268 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 137266) of binary: /home/.conda/envs/torch-cuda113/bin/python
Traceback (most recent call last):
File "/home/.conda/envs/torch-cuda113/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')())
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_glm_demo.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-01_10:41:24
host : gzailab-liuzixi01-colossalai2-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 137266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Environment
conda虚拟环境,python=3.9.13,pytorch=1.12+cuda11.3,显卡Nvidia A30*4
tensor_parallelize修改如下:
def tensor_parallelize(model: torch.nn.Module, pg: ProcessGroup):
"""tensor_parallelize
Sharding the Model Parameters.
Args:
model (torch.nn.Module): a torch module to be sharded
"""
for mn, module in model.named_modules():
if mn=='':
continue
for pn, param in module.named_parameters(recurse=False):
# NOTE() a param maybe shared by two modules
if hasattr(param, 'visited'):
continue
# print('*'*50, mn+'--'+pn, '*'*50)
# if shard init, then convert param to replica and use the dp-only ProcessGroup
param: ColoParameter = param
param.set_dist_spec(ReplicaSpec())
param.set_process_group(pg)
# shard it w.r.t tp pattern
if 'mlp.dense_4h_to_h' in mn:
if 'weight' in pn or 'bias' in pn:
split_param_col_tp1d(param, pg) # colmn slice
# keep the shape of the output from c_fc
param.compute_spec.set_output_replicate(False)
else:
param.set_dist_spec(ReplicaSpec())
elif 'mlp.dense_h_to_4h' in mn:
if 'weight' in pn:
split_param_row_tp1d(param, pg) # row slice
else:
param.set_dist_spec(ReplicaSpec())
elif 'word_embeddings' in mn:
if 'weight' in pn:
split_param_row_tp1d(param, pg) # colmn slice
elif 'position_embeddings' in mn:
if 'weight' in pn:
split_param_col_tp1d(param, pg) # colmn slice
elif 'query_key_value':
split_param_row_tp1d(param, pg) # colmn slice
else:
param.set_dist_spec(ReplicaSpec())
param.visited = True
@1SAA Could you take a look?
I load my model in this manner
with ColoInitContext(device=get_current_device(),
dtype=torch.half,
default_dist_spec=default_dist_spec,
default_pg=shard_pg):
# model = model_builder(args.model_type)(checkpoint=True)
model = GLMForConditionalGeneration.from_pretrained('THUDM/glm-10b-chinese', trust_remote_code=True)
tp_pg = ProcessGroup(tp_degree=2)
tensor_parallelize(model, tp_pg)
Hi @zixiliuUSC
Could you try `USE_SHARD_INIT: True, TP_DEGREE=1' first and provide a whole file to run your example code?
I upload a repo which is modified from gpt2 example. https://github.com/zixiliuUSC/colosssal-glm-demo
USE_SHARD_INIT with
USE_SHARD_INIT: False, TP_DEGREE=1, the code can be run normally in two A100 80GB.
Hi @zixiliuUSC
Could you try `USE_SHARD_INIT: True, TP_DEGREE=1' first and provide a whole file to run your example code? the program got broken when loading model. Part of traceback are below
Traceback (most recent call last):
File "/home/liuzixi01/colossal-example/glm/./train_glm_demo.py", line 434, in <module>
main()
File "/home/liuzixi01/colossal-example/glm/./train_glm_demo.py", line 245, in main
model = GLMForConditionalGeneration.from_pretrained('THUDM/glm-10b-chinese', trust_remote_code=True)#('/home/liuzixi01/.cache/huggingface/hub/models--BAAI--glm-10b-chinese/snapshots/4e46a55c50884f3df62cb3b550d1b10d4723228e/' , trust_remote_code=True)
File "/home/liuzixi01/.conda/envs/torch-cuda113/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2478, in from_pretrained
) = cls._load_pretrained_model(
File "/home/liuzixi01/.conda/envs/torch-cuda113/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2844, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for GLMForConditionalGeneration:
size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50048, 4096]) from checkpoint, the shape in current model is torch.Size([50048, 1024]).
size mismatch for transformer.position_embeddings.weight: copying a param with shape torch.Size([1025, 4096]) from checkpoint, the shape in current model is torch.Size([1025, 1024]).
size mismatch for transformer.block_position_embeddings.weight: copying a param with shape torch.Size([1025, 4096]) from checkpoint, the shape in current model is torch.Size([1025, 1024]).
size mismatch for transformer.layers.0.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for transformer.layers.0.input_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([1024]).
We have updated a lot. Please check the latest code. In addition, you changed the code to another model, so your issue is out of the scope of the general support of the open source community. If you need customized in-depth cooperation or support, please send the details to [email protected] This issue was closed due to inactivity. Thanks.