ColossalAI [BUG]: RuntimeError when training gpt with PP=8,TP=1,zero

🐛 Describe the bug

I can run with pp=8,tp=1 w/o zero strategy. myconfig is

# from model import GPT2_small_pipeline_hybrid
from model import GPT_13b_pp1d
import torch
from colossalai.nn.optimizer import HybridAdam
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.amp import AMP_TYPE


BATCH_SIZE = 2
NUM_EPOCHS = 4
SEQ_LEN = 4096
NUM_MICRO_BATCHES = 1
HIDDEN_SIZE = 5120
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, HIDDEN_SIZE)

#cudnn_benchmark = True
#cudnn_benchmark = False
# if you do no want zero, just comment out this dictionary
zero = dict(model_config=dict(tensor_placement_policy='cuda', shard_strategy=TensorShardStrategy()),
            optimizer_config=dict(initial_scale=2**5))

optimizer = dict(
    type=HybridAdam,
    lr=0.000015,
    weight_decay=1e-2,
)
#fp16 = dict(mode=AMP_TYPE.NAIVE)

model = dict(type=GPT_13b_pp1d,
             checkpoint=True,  #num_chunks=1,
             dtype=torch.half, #fused=True,
)

# pipeline parallel: modify integer value for the number of pipeline stages
# tensor parallel: modify size to set the tensor parallel size, usually the number of GPUs per node
# for the current model implementation, mode can only be 1D or None
parallel = dict(
    pipeline=8,
    tensor=dict(size=1, mode='1d'),
)

Traceback (most recent call last):
  File "train_gpt.py", line 130, in <module>
    main()
  File "train_gpt.py", line 126, in main
    return_output_label=False)
  File "/root/gpt/titans/mytrainer.py", line 325, in fit
    return_output_label=return_output_label,
  File "/root/gpt/titans/mytrainer.py", line 185, in _train_epoch
    return_output_label=return_output_label,
  File "/root/pkgs/py37/lib/python3.7/site-packages/colossalai/engine/_base_engine.py", line 201, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/root/pkgs/py37/lib/python3.7/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 395, in forward_backward_step
    accum_loss=accum_loss)
  File "/root/pkgs/py37/lib/python3.7/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 249, in _forward_step
    output_obj = self._call_engine(engine.model, data)
  File "/root/pkgs/py37/lib/python3.7/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 186, in _call_engine
    return model(stage_output, **data)
  File "/root/pkgs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/pkgs/py37/lib/python3.7/site-packages/colossalai/zero/sharded_model/sharded_model_v2.py", line 235, in forward
    outputs = self.module(*args, **kwargs)
  File "/root/pkgs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/gpt/titans/model/pipeline_gpt1d.py", line 56, in forward
    hidden_states = self.head(self.norm(hidden_states))
  File "/root/pkgs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/gpt/titans/model/embed.py", line 358, in forward
    x = F.linear(x, self.head.weight)
RuntimeError: size mismatch, got 8192, 8192x5120,0

Environment

No response

Feb 27 '23 08:02 ErenChan

This version of Pipeline Parallel requires users to adjust model to a distributed model themselves, which maybe not easy to use if you are not familiar with CAI's source code. If you want to try Pipeline Parallel, I recommend you follow the example https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/titans/model/pipeline_gpt1d.py to modify your own model. The other method is to try our developing feature aiming to reduce the users' burden. https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/pipeline_parallel

Mar 03 '23 08:03 Wesley-Jzy

Thanks for your comment. Make PP easy is what we are doing now. If you have any confusion, please don't hesitate to contact us.

Mar 03 '23 08:03 Wesley-Jzy

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 26 '23 10:04 binmakeswell

[BUG]: RuntimeError when training gpt with PP=8,TP=1,zero_setted

🐛 Describe the bug

Environment