oneCCL icon indicating copy to clipboard operation
oneCCL copied to clipboard

[Improvement] Allow multiple CCL inits from same process but from different threads

Open umamaheswararao opened this issue 4 years ago • 3 comments

Currently we can initialize multiple XGBoost Rabit instances from same process but from different thread. In Spark, its possible to have multiple tasks run on same executor. A executor is single JVM process and multiple tasks will in separate thread respectively.

Its not very critical requirement for us at this stage, but users can run that way, as Spark allows that. So, it will be good to support multiple thread to initialize oneCCL.

umamaheswararao avatar Feb 21 '20 22:02 umamaheswararao

This is also an issue for us. We have neural network "One thread, one worker" model that we want to keep consistent with our GPU implementation (based on NCCL). Neural network toolkits usually perform data level parallelism and when running of the CPU we use each CPU thread as a single worker.

Normally, on each machine we start a single process that initiates all the available cpu threads (or GPUs) as workers and we use a communication backend to communicate gradients. NCCL allows us to have multiple GPUs attached to the same process as workers, but we can't achieve that with oneCCL/MPI, as they work at the process level and are oblivious to any threading within the process.

With oneCCL as a workaround, we have to start N separate processes on each machine (where N corresponds to the number of CPU threads) and use oneCCL to handle the communication, which could otherwise be done internally.

We would rather not rewrite our full communication model in order to make use of the more efficient in-process communication.

XapaJIaMnu avatar Mar 17 '21 18:03 XapaJIaMnu

@XapaJIaMnu - CCL relies on multiple threads (CCL workers) to parallelize communication of current process. Having each CPU core mapped on CCL rank (not depending whether all ranks within single process or each rank in separate process) will not allow to have budget for CCL workers. Typical scenario for training on CPU (e.g. with PyTorch/DDP or with Horovod) is one process per CPU socket where cores are split between compute and communication threads. So the whole CPU socket is considered like single compute device, and it is mapped on single CCL rank.

mshiryaev avatar Mar 17 '21 21:03 mshiryaev

@mshiryaev our use case is machine translation models running the transformer architecture. Our input sequences are long, our models are big, our computational dependencies are strictly sequential and our matrices are not large enough to use OMP parallelism. Typically for a single mini-batch a worker will spend multiple seconds (or even minutes with very large mini-batches) computing forward and backward passes, and then only fractions of a second to communicate the gradients. For this scenario it is wasteful to not allocate every single CPU core to an individual worker.

XapaJIaMnu avatar Mar 18 '21 13:03 XapaJIaMnu