TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Why use two streams for context parallel

Open Edenzzzz opened this issue 1 year ago • 2 comments

Hi, I see in https://github.com/NVIDIA/TransformerEngine/blob/29e8bfc99d803770ad82ae9351db63673bc34f69/transformer_engine/pytorch/attention.py#L624 that you used two cuda streams to resolve "wave quantization" in flash attention. Could you clarify what "wave quantization" means? I think flash attention just uses fp16/bf16

Edenzzzz avatar Jun 19 '24 03:06 Edenzzzz

Two streams will help overlap communication and computation. The second stream can start processing the next chunk of data as soon as it is received, while the first stream is still working on the previous one.

i4never avatar Jun 20 '24 08:06 i4never

In fact, I repeatedly see longer runtime using two streams. Wave quantization is defined here: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

Edenzzzz avatar Jun 25 '24 07:06 Edenzzzz