[RFC]: Integrate Pipeline and Non-Pipeline Model Implementations
Proposal
In the current model zoo and examples, it can be often seen that one model has two different implementations, e.g. GPT and PipelineGPT. This is because that some large models require manual partitioning of the layers. For example, PipelineGPT partitions the layers inside its __init__. This looks not elegant and adds complexity to maintenance.
One way to integrate these implementations is to introduce an addition abstraction. For example, we can have Pipelinable interface, the model will inherit this class and implements the to_layer_list method. Afterwards, the colossalai.initialize should read the configuration for the pipeline policy and partition the model into different stages inside colossalai.initialize.
In this way, we can keep only one model implementation and let the policy to handle the layer partitioning.
from abc import ABC, abstractmethod
class Pipelinable(ABC):
def __init__(self):
self.policy = None
@abstractmethod
def to_layer_list(self) -> List[nn.Module]:
pass
def load_policy(self, policy: PartitionPolicy):
self.policy = policy
def partition(self):
layer_list = self.to_layer_list()
return self.policy.partition(layer_list)
Self-service
- [x] #794
In my opinion, PipelineGPT is different from its non-pipeline counterpart because users can only implement layers which will be pipelined with the same input args and output results. But in a non-pipeline model, users can do it with any format they'd like to use. I'm not sure whether this RFC is designed for them.
In my opinion, PipelineGPT is different from its non-pipeline counterpart because users can only implement layers which will be pipelined with the same input args and output results. But in a non-pipeline model, users can do it with any format they'd like to use. I'm not sure whether this RFC is designed for them.
May I know what "users can only implement layers which will be pipelined with the same input args and output results." means?
Just like the specific split for the attention mask. I wonder how it will be handled in the integration. https://github.com/hpcaitech/ColossalAI/blob/8711c706f4f01119dcc3923b942db304f39cc26b/model_zoo/gpt/gpt.py#L374 https://github.com/hpcaitech/ColossalAI/blob/8711c706f4f01119dcc3923b942db304f39cc26b/model_zoo/gpt/gpt.py#L297
Will shifting this logic to GPTBlock solve this problem? @Wesley-Jzy
@Wesley-Jzy Theoretically, the mask should be fetched from dataloader and partitioned along with the input ids in the embedding layer, and passed to follow-up transformer blocks. For the pipeline parallelism case, since all the blocks share the same mask, if we do so, it is a waste to pass the mask across the pipeline. Therefore, we partition the mask before the blocks respectively on each stage. However, for more common case, the first approach that passes all activations across the pipeline should make sense.
thx, I got it.
We have updated a lot. This issue was closed due to inactivity. Thanks.