torch-ccl icon indicating copy to clipboard operation
torch-ccl copied to clipboard

Communication and compute on separate Streams do not overlap

Open garrett361 opened this issue 8 months ago • 0 comments

Cross-posting this issue from ipex, in case the torch-ccl team is not aware of it.

Key issues:

  • Compute and collective communications do not overlap on intel GPU devices
  • Collectives block the host thread, rather than launching a kernel and immediately returning (as on NVIDIA devices)

The pytorch profiler traces highlight the issues (copied from the other thread):

A100 Trace

nvidia_a100_trace

Non-blocking kernel launch and comms/compute overlap.

Intel Max 1550 Trace

intel_1550_trace

Blocking kernel launch and no comms/compute overlap.

See the other thread for more details.

garrett361 avatar May 28 '24 13:05 garrett361