ColossalAI [FEATURE]: 2-stage DataParallel Load Sharded CheckPoint Strategy

[FEATURE]: 2-stage DataParallel Load Sharded CheckPoint Strategy

Open superleo opened this issue 1 year ago • 1 comments

Describe the feature

Describe the feature: Training in a large scale env with multiple DP group strategy will load huge data load throughtput. Ex: Training LLAMA2-70B (the checkpoint is 700G for example)with TP* PP*, DP 16. Need to load 700G * 16 data from share storage. Which cost too many time. Describe the solution you'd like: ```

Exchange ShardedTensors metadata between all nodes 2. Align needed tensors within DP groups 3. For each globally unique tensor: 3.a) on one of the ranks load it from storage to CPU and move to CUDA 3.b) allocate CUDA tensor on other ranks 3.c) broadcast within DP group 3.d) copy tensor content to the model param location 3.e) free tensor buffers from a) and b)Describe alternatives you've considered


**Optional: Affiliation**
Megatron is implementing this feature now.

Feb 23 '24 08:02 superleo

Hi Xiao Lei, thank you for sharing and using Colossal-AI! We will carefully consider your advice about this new features! As Colossal-AI is an open-source project, we encourage people to create their own repositories and help us to improve! It will be so great that you can customize this feature and contribute to us. If you have any questions or need assistance in getting started, please feel free to ask.

Feb 23 '24 09:02 Yanjia0

ColossalAI ColossalAI copied to clipboard

[FEATURE]: 2-stage DataParallel Load Sharded CheckPoint Strategy

Describe the feature

ColossalAI
ColossalAI copied to clipboard