ColossalAI [BUG]: layer norm error

🐛 Describe the bug

使用huggingface的模型接口，用gpt2的gemini demo跑glm-chinese-10b，修改了tensor_parallel函数和glm的modeling_glm.py来适配colossal，发现layer norm会报RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096] 运行脚本如下：

set -x
# distplan in ["CAI_ZeRO1", "CAI_ZeRO2", "CAI_Gemini", "Pytorch_DDP", "Pytorch_ZeRO"]
export DISTPLAN=${DISTPLAN:-"CAI_Gemini"}

# The following options only valid when DISTPLAN="colossalai"
export GPUNUM=${GPUNUM:-4}
export TPDEGREE=${TPDEGREE:-2}
export PLACEMENT=${PLACEMENT:-"cpu"}
export USE_SHARD_INIT=${USE_SHARD_INIT:-False}
export BATCH_SIZE=${BATCH_SIZE:-8}
export MODEL_TYPE=${MODEL_TYPE:-"gpt2_medium"}
export TRAIN_STEP=${TRAIN_STEP:-10}
# export PYTHONPATH=$PWD:$PYTHONPATH

if [ ${USE_SHARD_INIT} = "True" ]; then
  USE_SHARD_INIT="--shardinit"
else
  USE_SHARD_INIT=""
fi

mkdir -p gemini_logs

torchrun --standalone --nproc_per_node=${GPUNUM} ./train_glm_demo.py \
--tp_degree=${TPDEGREE} \
--model_type=${MODEL_TYPE} \
--batch_size=${BATCH_SIZE} \
--placement=${PLACEMENT} \
${USE_SHARD_INIT} \
--distplan=${DISTPLAN} \
--train_step=${TRAIN_STEP} \
2>&1 | tee ./gemini_logs/${MODEL_TYPE}_${DISTPLAN}_gpu_${GPUNUM}_bs_${BATCH_SIZE}_tp_${TPDEGREE}_${PLACEMENT}.log

traceback如下：


+ export DISTPLAN=CAI_Gemini
+ DISTPLAN=CAI_Gemini
+ export GPUNUM=4
+ GPUNUM=4
+ export TPDEGREE=2
+ TPDEGREE=2
+ export PLACEMENT=cpu
+ PLACEMENT=cpu
+ export USE_SHARD_INIT=False
+ USE_SHARD_INIT=False
+ export BATCH_SIZE=8
+ BATCH_SIZE=8
+ export MODEL_TYPE=gpt2_medium
+ MODEL_TYPE=gpt2_medium
+ export TRAIN_STEP=10
+ TRAIN_STEP=10
+ '[' False = True ']'
+ USE_SHARD_INIT=
+ mkdir -p gemini_logs
+ tee ./gemini_logs/gpt2_medium_CAI_Gemini_gpu_4_bs_8_tp_2_cpu.log
+ torchrun --standalone --nproc_per_node=4 ./train_glm_demo.py --tp_degree=2 --model_type=gpt2_medium --batch_size=8 --placement=cpu --distplan=CAI_Gemini --train_step=10
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
environmental variable OMP_NUM_THREADS is set to 20.
environmental variable OMP_NUM_THREADS is set to 20.
environmental variable OMP_NUM_THREADS is set to 20.
environmental variable OMP_NUM_THREADS is set to 20.
[03/01/23 10:39:28] INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
                             21 set_device                                                                                                 
[03/01/23 10:39:28] INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
                             21 set_device                                                                                                 
                    INFO     colossalai - colossalai - INFO: process rank 3 is bound to device 3                                           
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                           
[03/01/23 10:39:28] INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
                             21 set_device                                                                                                 
                    INFO     colossalai - colossalai - INFO: process rank 2 is bound to device 2                                           
[03/01/23 10:39:28] INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
                             21 set_device                                                                                                 
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                           
[03/01/23 10:39:31] INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
                             57 set_seed                                                                                                   
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024,                 
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.            
[03/01/23 10:39:31] INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
                             57 set_seed                                                                                                   
[03/01/23 10:39:31] INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
                             57 set_seed                                                                                                   
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,                 
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.            
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024,                 
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.            
                    INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/initialize.py:116 launch     
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 4, pipeline       
                             parallel size: 1, tensor parallel size: 1                                                                     
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
                    INFO     colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:221 main             
                    INFO     colossalai - colossalai - INFO: gpt2_medium, CAI_Gemini, batch size 8                                         
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[03/01/23 10:39:31] INFO     colossalai - colossalai - INFO:                                                                               
                             /home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/context/parallel_context.py:5
                             57 set_seed                                                                                                   
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,                 
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.            
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
************************************************** load model finish **************************************************
************************************************** start ProcessGroup **************************************************
************************************************** load model finish **************************************************
************************************************** start ProcessGroup **************************************************
************************************************** load model finish **************************************************
************************************************** start ProcessGroup **************************************************
************************************************** load model finish **************************************************
************************************************** start ProcessGroup **************************************************
************************************************** start tensor parallel **************************************************
************************************************** start tensor parallel **************************************************
************************************************** start tensor parallel **************************************************
************************************************** start tensor parallel **************************************************
************************************************** build adam **************************************************
************************************************** build adam **************************************************
************************************************** build adam **************************************************
************************************************** build adam **************************************************
=========================================================================================
No pre-built kernel is found, build and load the cpu_adam kernel during runtime now
=========================================================================================
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
************************************************** zero model process **************************************************
Loading extension module cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
************************************************** zero model process **************************************************
Loading extension module cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
************************************************** zero model process **************************************************
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.284288167953491 seconds
=========================================================================================
No pre-built kernel is found, build and load the fused_optim kernel during runtime now
=========================================================================================
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
Time to load fused_optim op: 0.39287424087524414 seconds
************************************************** zero model process **************************************************
searching chunk configuration is completed in 3.11 s.
used number: 4711.35 MB, wasted number: 28.66 MB
total wasted percentage is 0.60%
************************************************** zero optim process **************************************************
************************************************** load pack parallel **************************************************
[03/01/23 10:41:16] INFO     colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:307 main             
                    INFO     colossalai - colossalai - INFO: the size of testing model size is 4.9B.                                       
load tokenizer
************************************************** zero optim process **************************************************
************************************************** load pack parallel **************************************************
[03/01/23 10:41:17] INFO     colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:291 main             
                    INFO     colossalai - colossalai - INFO: After init optim, GPU memory usage: 0.00 MB, CPU memory usage: 29621.83 MB    
************************************************** zero optim process **************************************************
                    INFO     colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:307 main             
                    INFO     colossalai - colossalai - INFO: the size of testing model size is 4.9B.                                       
                    INFO     colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:308 main             
                    INFO     colossalai - colossalai - INFO: After init model, GPU memory usage: 0.00 MB, CPU memory usage: 29621.83 MB    
load tokenizer
************************************************** zero optim process **************************************************
************************************************** load pack parallel **************************************************
[03/01/23 10:41:17] INFO     colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:307 main             
                    INFO     colossalai - colossalai - INFO: the size of testing model size is 4.9B.                                       
load tokenizer
************************************************** load pack parallel **************************************************
[03/01/23 10:41:17] INFO     colossalai - colossalai - INFO: /home/colossal-example/glm/./train_glm_demo.py:307 main             
                    INFO     colossalai - colossalai - INFO: the size of testing model size is 4.9B.                                       
load tokenizer
Traceback (most recent call last):
  File "/home/colossal-example/glm/./train_glm_demo.py", line 427, in <module>
    main()
  File "/home/colossal-example/glm/./train_glm_demo.py", line 379, in main
    train_step(batch)
  File "/home/colossal-example/glm/./train_glm_demo.py", line 331, in train_step
    outputs = model(**inputs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
Traceback (most recent call last):
  File "/home/colossal-example/glm/./train_glm_demo.py", line 427, in <module>
Traceback (most recent call last):
  File "/home/colossal-example/glm/./train_glm_demo.py", line 427, in <module>
Traceback (most recent call last):
  File "/home/colossal-example/glm/./train_glm_demo.py", line 427, in <module>
    main()
  File "/home/colossal-example/glm/./train_glm_demo.py", line 379, in main
    main()
      File "/home/colossal-example/glm/./train_glm_demo.py", line 379, in main
main()
  File "/home/colossal-example/glm/./train_glm_demo.py", line 379, in main
    train_step(batch)
  File "/home/colossal-example/glm/./train_glm_demo.py", line 331, in train_step
    train_step(batch)
  File "/home/colossal-example/glm/./train_glm_demo.py", line 331, in train_step
    outputs = model(**inputs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    outputs = model(**inputs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    train_step(batch)
  File "/home/colossal-example/glm/./train_glm_demo.py", line 331, in train_step
    outputs = model(**inputs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward
    return forward_call(*input, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward
    return forward_call(*input, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward
    return forward_call(*input, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward
    outputs = self.module(*args, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    outputs = self.module(*args, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 904, in forward
    model_output = self.glm.forward(input_ids, position_ids, attention_mask, mems=mems, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 793, in forward
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 904, in forward
    transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 605, in forward
    model_output = self.glm.forward(input_ids, position_ids, attention_mask, mems=mems, **kwargs)
    hidden_states = layer(*args, mem=mem_i)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  File "/home/colossal-example/glm/modeling_glm.py", line 793, in forward
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 425, in forward
    layernorm_output = self.input_layernorm(hidden_states)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 605, in forward
    hidden_states = layer(*args, mem=mem_i)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 425, in forward
    layernorm_output = self.input_layernorm(hidden_states)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
    outputs = self.module(*args, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 904, in forward
    model_output = self.glm.forward(input_ids, position_ids, attention_mask, mems=mems, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 793, in forward
    transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 605, in forward
    hidden_states = layer(*args, mem=mem_i)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    outputs = self.module(*args, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 425, in forward
    layernorm_output = self.input_layernorm(hidden_states)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 904, in forward
    return forward_call(*input, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
    model_output = self.glm.forward(input_ids, position_ids, attention_mask, mems=mems, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 793, in forward
    transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 605, in forward
    hidden_states = layer(*args, mem=mem_i)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/colossal-example/glm/modeling_glm.py", line 425, in forward
    layernorm_output = self.input_layernorm(hidden_states)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
    return F.layer_norm(    
return F.layer_norm(
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
    return F.layer_norm(
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
    return F.layer_norm(
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
        return handle_torch_function(
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
return handle_torch_function(
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
    return handle_torch_function(
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
    return handle_torch_function(
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
    result = torch_func_method(public_api, types, args, kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 87, in __torch_function__
    result = torch_func_method(public_api, types, args, kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 87, in __torch_function__
    result = torch_func_method(public_api, types, args, kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 87, in __torch_function__
    ret = super().__torch_function__(func, types, args, kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 184, in __torch_function__
    ret = func(*args, **kwargs)
          File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/_ops/layernorm.py", line 21, in colo_layernorm
ret = super().__torch_function__(func, types, args, kwargs)ret = super().__torch_function__(func, types, args, kwargs)

  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 184, in __torch_function__
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 184, in __torch_function__
        ret = func(*args, **kwargs)
ret = func(*args, **kwargs)  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/_ops/layernorm.py", line 21, in colo_layernorm

  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/_ops/layernorm.py", line 21, in colo_layernorm
    result = torch_func_method(public_api, types, args, kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 87, in __torch_function__
    ret = super().__torch_function__(func, types, args, kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 184, in __torch_function__
    ret = func(*args, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/colossalai/nn/_ops/layernorm.py", line 21, in colo_layernorm
    output = F.layer_norm(input_tensor, normalized_shape, weight=weight, bias=bias, eps=eps)
      File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
output = F.layer_norm(input_tensor, normalized_shape, weight=weight, bias=bias, eps=eps)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096]
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096]
    output = F.layer_norm(input_tensor, normalized_shape, weight=weight, bias=bias, eps=eps)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096]
    output = F.layer_norm(input_tensor, normalized_shape, weight=weight, bias=bias, eps=eps)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2048] and normalized_shape = [4096]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 137265 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 137267 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 137268 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 137266) of binary: /home/.conda/envs/torch-cuda113/bin/python
Traceback (most recent call last):
  File "/home/.conda/envs/torch-cuda113/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')())
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/.conda/envs/torch-cuda113/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./train_glm_demo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-01_10:41:24
  host      : gzailab-liuzixi01-colossalai2-0
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 137266)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Environment

conda虚拟环境，python=3.9.13，pytorch=1.12+cuda11.3，显卡Nvidia A30*4

Mar 01 '23 03:03 zixiliuUSC

tensor_parallelize修改如下：

def tensor_parallelize(model: torch.nn.Module, pg: ProcessGroup):
    """tensor_parallelize
    Sharding the Model Parameters.
    Args:
        model (torch.nn.Module): a torch module to be sharded
    """
    for mn, module in model.named_modules():
        if mn=='':
            continue
        for pn, param in module.named_parameters(recurse=False):
            # NOTE() a param maybe shared by two modules
            if hasattr(param, 'visited'):
                continue
            # print('*'*50, mn+'--'+pn, '*'*50)
            # if shard init, then convert param to replica and use the dp-only ProcessGroup
            param: ColoParameter = param
            param.set_dist_spec(ReplicaSpec())
            param.set_process_group(pg)

            # shard it w.r.t tp pattern
            if 'mlp.dense_4h_to_h' in mn: 
                if 'weight' in pn or 'bias' in pn:
                    split_param_col_tp1d(param, pg)    # colmn slice
                    # keep the shape of the output from c_fc
                    param.compute_spec.set_output_replicate(False)
                else:
                    param.set_dist_spec(ReplicaSpec())
            elif 'mlp.dense_h_to_4h' in mn:
                if 'weight' in pn:
                    split_param_row_tp1d(param, pg)    # row slice
                else:
                    param.set_dist_spec(ReplicaSpec())
            elif 'word_embeddings' in mn: 
                if 'weight' in pn:
                    split_param_row_tp1d(param, pg)    # colmn slice
            elif 'position_embeddings' in mn:
                if 'weight' in pn:
                    split_param_col_tp1d(param, pg)    # colmn slice
            elif 'query_key_value':
                split_param_row_tp1d(param, pg)    # colmn slice
            else:
                param.set_dist_spec(ReplicaSpec())
            param.visited = True

Mar 01 '23 03:03 zixiliuUSC

@1SAA Could you take a look?

Mar 01 '23 03:03 kurisusnowdeng

I load my model in this manner

with ColoInitContext(device=get_current_device(),
                             dtype=torch.half,
                             default_dist_spec=default_dist_spec,
                             default_pg=shard_pg):
    # model = model_builder(args.model_type)(checkpoint=True)
    model = GLMForConditionalGeneration.from_pretrained('THUDM/glm-10b-chinese', trust_remote_code=True)
    tp_pg = ProcessGroup(tp_degree=2)
    tensor_parallelize(model, tp_pg)

Mar 01 '23 06:03 zixiliuUSC

Hi @zixiliuUSC

Could you try `USE_SHARD_INIT: True, TP_DEGREE=1' first and provide a whole file to run your example code?

Mar 03 '23 09:03 1SAA

I upload a repo which is modified from gpt2 example. https://github.com/zixiliuUSC/colosssal-glm-demo

Mar 07 '23 08:03 zixiliuUSC

USE_SHARD_INIT with USE_SHARD_INIT: False, TP_DEGREE=1, the code can be run normally in two A100 80GB.

Mar 07 '23 08:03 zixiliuUSC

Hi @zixiliuUSC

Could you try `USE_SHARD_INIT: True, TP_DEGREE=1' first and provide a whole file to run your example code? the program got broken when loading model. Part of traceback are below

Traceback (most recent call last):
  File "/home/liuzixi01/colossal-example/glm/./train_glm_demo.py", line 434, in <module>
    main()
  File "/home/liuzixi01/colossal-example/glm/./train_glm_demo.py", line 245, in main
    model = GLMForConditionalGeneration.from_pretrained('THUDM/glm-10b-chinese', trust_remote_code=True)#('/home/liuzixi01/.cache/huggingface/hub/models--BAAI--glm-10b-chinese/snapshots/4e46a55c50884f3df62cb3b550d1b10d4723228e/' , trust_remote_code=True)
  File "/home/liuzixi01/.conda/envs/torch-cuda113/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2478, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/liuzixi01/.conda/envs/torch-cuda113/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2844, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for GLMForConditionalGeneration:
	size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50048, 4096]) from checkpoint, the shape in current model is torch.Size([50048, 1024]).
	size mismatch for transformer.position_embeddings.weight: copying a param with shape torch.Size([1025, 4096]) from checkpoint, the shape in current model is torch.Size([1025, 1024]).
	size mismatch for transformer.block_position_embeddings.weight: copying a param with shape torch.Size([1025, 4096]) from checkpoint, the shape in current model is torch.Size([1025, 1024]).
	size mismatch for transformer.layers.0.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for transformer.layers.0.input_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([1024]).

Mar 07 '23 10:03 zixiliuUSC

We have updated a lot. Please check the latest code. In addition, you changed the code to another model, so your issue is out of the scope of the general support of the open source community. If you need customized in-depth cooperation or support, please send the details to [email protected] This issue was closed due to inactivity. Thanks.

Apr 27 '23 07:04 binmakeswell