xccl icon indicating copy to clipboard operation
xccl copied to clipboard

Poor performance with NVLink

Open froody opened this issue 4 years ago • 2 comments

I was running some benchmarks with torch-ucc using xccl for collectives, and I noticed very bad performance compared to NCCL. See numbers here: https://gist.github.com/froody/a86a5b2c5d9f46aedba7e930f4b4e08d

It's possible this is due to a misconfiguration, I built xccl with cuda and ucx support, but without sharp or vmc support. My question is - is it expected for xccl to properly utilize NVLink when available (in this case on a DGX-1 doing all-reduce across all 8 GPUs)?

I also noticed when running the benchmarks that CPU utilization as very high for all workers which seemed to be due to high-frequency polling.

Also as you can see in the output, ucc fails trying to reduce a 2gb tensor whereas nccl fails trying to reduce an 8gb tensor. This could be indicative of a leak somewhere.

Repro steps: Run benchmark here: https://gist.github.com/froody/01ed6ce8d6ab72bd868431d793591379 Use BACKEND=ucc or BACKEND=nccl to select backend

hardware: DGX-1, Driver Version: 418.116.00 cuda: 10.1 pytorch: 1.6.0 ucx: 1.9.0 torch-ucc: a277d7da24ae6e8a40bda658d0f0d4e06fcadb8b xccl: 2e97986fa14ee2538c6ffc577bb75d7434755935

froody avatar Sep 24 '20 21:09 froody

Does affinitizing MPI rank to GPU expected to help?

srinivas212 avatar Sep 26 '20 01:09 srinivas212

Do you mean torch.cuda.set_device()? If so then yes. I also changed torch_ucc to use cudaGetDevice in ProcessGroupUCC::progress_loop instead of hard-coding device 0.

froody avatar Sep 28 '20 19:09 froody