Paddle Pipeline and Virtual-Pipeline Parallelism Using CUDA Graph and Integrate CUDAMallocAsyncAllocator

PR types

New features

PR changes

Others

Description

This PR introduces Pipeline Parallelism (PP) and Virtual-Pipeline Parallelism (VP) training through the integration of CUDA Graph. The following is a detailed breakdown of the challenges encountered and the innovative solutions we have implemented:

PP/VP + CUDA Graph

Usage: Enable CUDA Graph in PipelineLayer using the use_cudagraph=true flag.

PP and CUDA Graph:
- Challenge: PP layers differ from standard nn.Layer as they are distributed across multiple GPUs. They are of two types: Layer and SharedLayer. SharedLayer's need for synchronous communication conflicts with CUDA Graph's requirements.
- Solution: We have restructured PP layers into multiple capturable sublayers (contiguous Layers), enabling each to be captured into a separate graph. This facilitates efficient multi-GPU execution and mitigates the communication conflicts inherent in SharedLayers.
VP and CUDA Graph:
- Challenge: VP employs a non-traditional training pattern, executing multiple forward (FW) and backward (BW) stages concurrently. This necessitates continuous access to previous input data for each layer throughout the training cycle.
- Solution: To accommodate this, we've integrated a queue into the CUDAGraphedLayer to store previous inputs. Each virtual pipeline stage is then captured as a separate graph, ensuring efficient and accurate training.

CUDAMallocAsyncAllocator

Usage: Activate the CUDAMallocAsyncAllocator by setting FLAG_use_cuda_malloc_async_allocator=1.

CUDAMallocAsyncAllocator The CUDAMallocAsyncAllocator replaced the StreamSafeCUDAAllocator. By leveraging the advanced capabilities of cudaMallocAsync and cudaFreeAsync, the responsibility for stream-ordered memory management is transferred from the framework to CUDA. This transition may lead to better memory utilization and potentially improved application performance by optimizing the way memory is allocated and deallocated within the CUDA.
CUDAMallocAsyncAllocator + CUDAGraph
- Challenge: The integration of PP/VP with CUDA Graph led to an excessive creation of graphs, and each graph has its own memory pool in Paddle's existing implementation. This resulted in low memory reuse and high memory footprint as allocations were not released until the deletion of the graph.
- Solution: The introduction of the CUDAMallocAsyncAllocator provides a sophisticated solution to this challenge. By capturing memory allocation and deallocation semantics within the operations of CUDAGraph, the allocator significantly optimizes memory management. For instance, in the context of GPT3_1.3B+PP4+BF16 CUDAGraph training using 4 H100-80GB GPUs, the implementation of CUDAMallocAsyncAllocator has been observed to drastically reduce memory usage from a 95% down to 25%.

Jan 02 '24 15:01 eee4017

你的PR提交成功，感谢你对开源项目的贡献! 请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。 Your PR has been submitted. Thanks for your contribution! Please wait for the result of CI firstly. See Paddle CI Manual for details.

Jan 02 '24 15:01 paddle-bot[bot]

2024-01-09 18:03:11 0. You must have one RD (lanxianghit (Recommend), phlrain or luotao1 or Aurelius84) approval for changing the FLAGS, which manages the environment variables.

Jan 11 '24 03:01 onecatcn

LGTM， print可以下一个PR 修改下么？

OK

Jan 11 '24 04:01 eee4017

Sorry to inform you that 5674da9's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Jan 17 '24 03:01 paddle-ci-bot[bot]

2024-02-01 11:03:37 ************************************************************** 2024-02-01 11:03:37 Please find RD for approval first, and then find TPM for approval. 2024-02-01 11:03:37 0. APIs without core.ops: 2024-02-01 11:03:37 paddle.device.cuda.graphs.construct_program_and_find_ins_outs 2024-02-01 11:03:37 You must have one RD (JiabinYang (Recommend) or wanghuancoder, phlrain) approval for the api change for the opreator-related api without '_C_ops'. 2024-02-01 11:03:37 For more details, please click [https://github.com/PaddlePaddle/Paddle/wiki/paddle_api_development_manual.md] 2024-02-01 11:03:37 2024-02-01 11:03:37 There are 1 approved errors. 2024-02-01 11:03:37 ************************************************************** 2024-01-31 16:23:51 **************** 2024-01-31 16:23:51 0. You must have one RD (lanxianghit (Recommend), phlrain or luotao1 or Aurelius84) approval for changing the FLAGS, which manages the environment variables. 2024-01-31 16:23:51 1. You must have raindrops2sea or XiaoguangHu01 approval for change 20+ files or add than 1000+ lines of content. 2024-01-31 16:23:51 2. You must have one RD (risemeup1 or Galaxy1458) approval for the change of C++ template. 2024-01-31 16:23:51 There are 3 approved errors. 2024-01-31 16:23:51 ****************

Feb 05 '24 06:02 onecatcn

LGTM

Feb 20 '24 04:02 From00

Paddle Paddle copied to clipboard

Pipeline and Virtual-Pipeline Parallelism Using CUDA Graph and Integrate CUDAMallocAsyncAllocator

PR types

PR changes

Description

PP/VP + CUDA Graph

CUDAMallocAsyncAllocator

Paddle
Paddle copied to clipboard