ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[RFC]: Integrate Pipeline and Non-Pipeline Model Implementations

Open FrankLeeeee opened this issue 3 years ago • 6 comments

Proposal

In the current model zoo and examples, it can be often seen that one model has two different implementations, e.g. GPT and PipelineGPT. This is because that some large models require manual partitioning of the layers. For example, PipelineGPT partitions the layers inside its __init__. This looks not elegant and adds complexity to maintenance.

One way to integrate these implementations is to introduce an addition abstraction. For example, we can have Pipelinable interface, the model will inherit this class and implements the to_layer_list method. Afterwards, the colossalai.initialize should read the configuration for the pipeline policy and partition the model into different stages inside colossalai.initialize.

In this way, we can keep only one model implementation and let the policy to handle the layer partitioning.

from abc import ABC, abstractmethod

class Pipelinable(ABC):

     def __init__(self):
           self.policy = None

    @abstractmethod
    def to_layer_list(self) -> List[nn.Module]:
           pass

    def load_policy(self, policy: PartitionPolicy):
           self.policy = policy

    def partition(self):
           layer_list = self.to_layer_list()
           return self.policy.partition(layer_list)

Self-service

  • [x] #794

FrankLeeeee avatar Apr 18 '22 07:04 FrankLeeeee

In my opinion, PipelineGPT is different from its non-pipeline counterpart because users can only implement layers which will be pipelined with the same input args and output results. But in a non-pipeline model, users can do it with any format they'd like to use. I'm not sure whether this RFC is designed for them.

Wesley-Jzy avatar Apr 18 '22 07:04 Wesley-Jzy

In my opinion, PipelineGPT is different from its non-pipeline counterpart because users can only implement layers which will be pipelined with the same input args and output results. But in a non-pipeline model, users can do it with any format they'd like to use. I'm not sure whether this RFC is designed for them.

May I know what "users can only implement layers which will be pipelined with the same input args and output results." means?

FrankLeeeee avatar Apr 18 '22 07:04 FrankLeeeee

Just like the specific split for the attention mask. I wonder how it will be handled in the integration. https://github.com/hpcaitech/ColossalAI/blob/8711c706f4f01119dcc3923b942db304f39cc26b/model_zoo/gpt/gpt.py#L374 https://github.com/hpcaitech/ColossalAI/blob/8711c706f4f01119dcc3923b942db304f39cc26b/model_zoo/gpt/gpt.py#L297

Wesley-Jzy avatar Apr 18 '22 08:04 Wesley-Jzy

Will shifting this logic to GPTBlock solve this problem? @Wesley-Jzy

FrankLeeeee avatar Apr 18 '22 08:04 FrankLeeeee

@Wesley-Jzy Theoretically, the mask should be fetched from dataloader and partitioned along with the input ids in the embedding layer, and passed to follow-up transformer blocks. For the pipeline parallelism case, since all the blocks share the same mask, if we do so, it is a waste to pass the mask across the pipeline. Therefore, we partition the mask before the blocks respectively on each stage. However, for more common case, the first approach that passes all activations across the pipeline should make sense.

kurisusnowdeng avatar Apr 18 '22 08:04 kurisusnowdeng

thx, I got it.

Wesley-Jzy avatar Apr 18 '22 08:04 Wesley-Jzy

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 04:04 binmakeswell