[MPIX Stream] CUDA event compatible completion semantics

Open R0n12 opened this issue 2 months ago • 1 comments

Summary

When using MPIX Stream enqueue APIs with MPICH ch4:ucx in a PyTorch ProcessGroup backend, we observe correctness failures for larger allreduce sizes unless we force a cudaStreamSynchronize after each MPIX enqueue. This strongly suggests the “completion” seen by the application (via a CUDA event recorded after enqueue) can precede the actual GPU work’s visibility/completion on the provided CUDA stream. We need an event-compatible completion mechanism so frameworks can remain non-blocking and preserve overlap.

Scenario

Use a CUDA stream S to

enqueue an in-place allreduce via MPIX_Allreduce_enqueue(..., stream_comm) (created from S)
record a CUDA event E on S immediately after enqueue
later rely on E (or event-wait) for completion instead of cudaStreamSynchronize(S)

Proposal

A CUDA event recorded after enqueue on the same stream should become “ready” only after the underlying GPU work for the collective is finished and data is visible to subsequent consumers on that stream. This allows non-blocking frameworks to:

return immediately
compose precise cross-stream dependencies via cudaStreamWaitEvent
avoid stream-wide synchronizations

Why this matters

PyTorch, NCCL, and other GPU-accelerated frameworks rely on CUDA events to coordinate completion without stalling streams or the CPU. For process-group backends, we need the comm library to either:

execute the GPU work on the provided CUDA stream and let us record a reliable end-event, or
provide an API to record/obtain a completion event itself.

Is this doable? Please let me know your thoughts!

Oct 01 '25 18:10 R0n12

I agree the event should observe the enqueued operations. I need examine the implementation. Do you have a reproducer?

Oct 09 '25 02:10 hzhou