ColossalAI
ColossalAI copied to clipboard
[FEATURE]: set ComputePattern for op rather than parameter
Describe the feature
We set spec on parameter now, which means each paramter has its own unchanged compute_pattern. However, some models, like GPT-2, share parameter among different layers. GPT-2 shares token embedding weight among embedding layer and classifier layer, and it calls torch.nn.functional.embedding
in embedding layer and calls torch.nn.functional.linear
in classifier layer. One parameter may have two different compute pattern during training. Therefore, we should set compute pattern for operations rather than parameters.