ColossalAI [RFC]: Integrate Pipeline and Non-Pipeline Model Implementations

Proposal

In the current model zoo and examples, it can be often seen that one model has two different implementations, e.g. GPT and PipelineGPT. This is because that some large models require manual partitioning of the layers. For example, PipelineGPT partitions the layers inside its __init__. This looks not elegant and adds complexity to maintenance.

One way to integrate these implementations is to introduce an addition abstraction. For example, we can have Pipelinable interface, the model will inherit this class and implements the to_layer_list method. Afterwards, the colossalai.initialize should read the configuration for the pipeline policy and partition the model into different stages inside colossalai.initialize.

In this way, we can keep only one model implementation and let the policy to handle the layer partitioning.

from abc import ABC, abstractmethod

class Pipelinable(ABC):

     def __init__(self):
           self.policy = None

    @abstractmethod
    def to_layer_list(self) -> List[nn.Module]:
           pass

    def load_policy(self, policy: PartitionPolicy):
           self.policy = policy

    def partition(self):
           layer_list = self.to_layer_list()
           return self.policy.partition(layer_list)

Self-service

[x] #794

Apr 18 '22 07:04 FrankLeeeee

In my opinion, PipelineGPT is different from its non-pipeline counterpart because users can only implement layers which will be pipelined with the same input args and output results. But in a non-pipeline model, users can do it with any format they'd like to use. I'm not sure whether this RFC is designed for them.

Apr 18 '22 07:04 Wesley-Jzy

In my opinion, PipelineGPT is different from its non-pipeline counterpart because users can only implement layers which will be pipelined with the same input args and output results. But in a non-pipeline model, users can do it with any format they'd like to use. I'm not sure whether this RFC is designed for them.

May I know what "users can only implement layers which will be pipelined with the same input args and output results." means?

Apr 18 '22 07:04 FrankLeeeee

Just like the specific split for the attention mask. I wonder how it will be handled in the integration. https://github.com/hpcaitech/ColossalAI/blob/8711c706f4f01119dcc3923b942db304f39cc26b/model_zoo/gpt/gpt.py#L374 https://github.com/hpcaitech/ColossalAI/blob/8711c706f4f01119dcc3923b942db304f39cc26b/model_zoo/gpt/gpt.py#L297

Apr 18 '22 08:04 Wesley-Jzy

Will shifting this logic to GPTBlock solve this problem? @Wesley-Jzy

Apr 18 '22 08:04 FrankLeeeee

@Wesley-Jzy Theoretically, the mask should be fetched from dataloader and partitioned along with the input ids in the embedding layer, and passed to follow-up transformer blocks. For the pipeline parallelism case, since all the blocks share the same mask, if we do so, it is a waste to pass the mask across the pipeline. Therefore, we partition the mask before the blocks respectively on each stage. However, for more common case, the first approach that passes all activations across the pipeline should make sense.

Apr 18 '22 08:04 kurisusnowdeng

thx, I got it.

Apr 18 '22 08:04 Wesley-Jzy

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 13 '23 04:04 binmakeswell