TransformerEngine
TransformerEngine copied to clipboard
Cpu reload double buffer
Description
Added a feature to implement double buffer while reloading activations from CPU to GPU.
This helps reduce memory fragmentation when using CPU offloading close to GPU peak memory.
Note that this feature works only when you have symmetrical modules across sync functions (LLM training is the main use case, not DiT or Multi-Modal!)