torch-ccl
torch-ccl copied to clipboard
Communication and compute on separate Streams do not overlap
Cross-posting this issue from ipex
, in case the torch-ccl
team is not aware of it.
Key issues:
- Compute and collective communications do not overlap on intel GPU devices
- Collectives block the host thread, rather than launching a kernel and immediately returning (as on NVIDIA devices)
The pytorch profiler traces highlight the issues (copied from the other thread):
A100 Trace
Non-blocking kernel launch and comms/compute overlap.
Intel Max 1550 Trace
Blocking kernel launch and no comms/compute overlap.
See the other thread for more details.