torch-ccl Communication and compute on separate Streams do not overlap

Communication and compute on separate Streams do not overlap

Open garrett361 opened this issue 8 months ago • 0 comments

Cross-posting this issue from ipex, in case the torch-ccl team is not aware of it.

Key issues:

Compute and collective communications do not overlap on intel GPU devices
Collectives block the host thread, rather than launching a kernel and immediately returning (as on NVIDIA devices)

The pytorch profiler traces highlight the issues (copied from the other thread):

Non-blocking kernel launch and comms/compute overlap.

Blocking kernel launch and no comms/compute overlap.

See the other thread for more details.

May 28 '24 13:05 garrett361