ucc
ucc copied to clipboard
CUDA: support for lazy init
What
Lazily initialize TL NCCL and TL CUDA on first CUDA collective.
Why ?
Both NCCL and CUDA require CUDA devices to be set before team create. In MPI workloads it's not always possible since MPI_Init creates UCC team and to set device we need to know rank and local rank.