ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: shardformer: pipeline forward error with customized layer distribution

Open insujang opened this issue 1 year ago β€’ 3 comments

πŸ› Describe the bug

Hi, I am trying to implement a custom shard policy with different layer distribution, but it seems all built-in policies have the following inconsistent implementation:

In get_held_layers(), a policy uses self.distribute_layers() and self.get_stage_index(), which are customizable: https://github.com/hpcaitech/ColossalAI/blob/79718fae04fc4461a35ae80ab87f52b64260f394/colossalai/shardformer/policies/gpt2.py#L170-L175

But in set_pipeline_forward(), the policy uses Policy.distribute_layers() and Policy.get_stage_index(): https://github.com/hpcaitech/ColossalAI/blob/79718fae04fc4461a35ae80ab87f52b64260f394/colossalai/shardformer/policies/gpt2.py#L192-L193

which will raise an error during pipeline forward due to layer inconsistency if the functions are overridden.

How to reproduce

I tested with examples/language/gpt/hybridparallelism/finetune.py. For hybrid_parallel plugin, add a custom policy:

 elif args.plugin == "hybrid_parallel":
        BATCH_SIZE = 128
        from colossalai.shardformer.policies.base_policy import Policy
        from colossalai.shardformer.policies.gpt2 import GPT2ForSequenceClassificationPolicy
        class CustomGPT2Policy(GPT2ForSequenceClassificationPolicy):
            @staticmethod
            def distribute_layers(num_layers: int, num_stages: int) -> List[int]:
                layers_per_stage = Policy.distribute_layers(num_layers - 4, num_stages)
                layers_per_stage[0] += 4
                return layers_per_stage

        plugin = HybridParallelPlugin(
            tp_size=1,
            pp_size=4,
            num_microbatches=None,
            microbatch_size=8,
            zero_stage=0,
            precision="fp16",
            initial_scale=1,
            custom_policy=CustomGPT2Policy(),
        )

which distributes layers in a slightly different way: first stage has 4 more layers.

This leads the following error:

...
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 312, in forward
    query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/pytorch_utils.py", line 107, in forward
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
TypeError: addmm(): argument 'input' (position 1) must be Tensor, not NoneType

Environment

torch 2.1.0 + cu118

insujang avatar Dec 15 '23 03:12 insujang

Thanks for reporting. Would you like to submit a PR to solve this issue :)

CWHer avatar Dec 15 '23 04:12 CWHer

Submitted!

insujang avatar Dec 15 '23 04:12 insujang

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Submitted!

Issues-translate-bot avatar Dec 15 '23 04:12 Issues-translate-bot

Sorry for the delayed update, since I was assigned to another task for the last several months, and this issue is finally resolved.

CWHer avatar Mar 27 '24 08:03 CWHer