[feature] new zero implementation

Open 1SAA opened this issue 1 year ago • 0 comments

A New ZeRO Implementation

Backgrounds

In the current version, our ZeRO has a performance issue. The reason is that our asymmetric distribution of chunks makes one process hinder other processes when reading the content of the chunk located in the CPU memory. It really prolongs the time of the data transmission and undermines the efficiency of ZeRO.

Implementation

In order to solve this problem, I refactored the class Chunk. The new chunk is distributed evenly to all processes. All processes can move the data from the CPU memory to the CUDA memory in the same time. Furthermore, I provide an option to enable the pin memory for chunks. Now all chunks can have a copy in the pinned CPU memory. The above optimizations prominently improved the efficiency of data movements between CPU and CUDA.

Another Advantage

The new ZeRO supports the true hybrid parallelism. It creates different chunk groups for parameters which have different DP communication groups. This brings huge flexibility to our up coming automatic configuration of parallelism.

Sep 21 '22 10:09 1SAA

ColossalAI ColossalAI copied to clipboard

[feature] new zero implementation

A New ZeRO Implementation

Backgrounds

Implementation

Another Advantage

ColossalAI
ColossalAI copied to clipboard