[CUDA] Multi-GPU and distributed training for new CUDA version.

Open shiyu1994 opened this issue 3 years ago • 1 comments

Summary

Add multi-gpu and distributed support for new CUDA version. As mentioned in https://github.com/microsoft/LightGBM/pull/4630#discussion_r825489849

Mar 15 '22 07:03 shiyu1994

Hi, @shiyu1994 I find in the latest 4.5.0 there is no support for multi-gpu training. If do user will get errors: "Currently cuda version only supports training on a single GPU". Then I find this branch nccl-dev your're working, wondering when can u release it? Anyway I tried to run it in my env:

NVIDIA-Tesla P100, Driver Version: 470.82.01 CUDA Version: 11.4
NCCL: nccl_2.11.4-1+cuda11.4_x86_64 OS: Debian bullseye

When I set num_gpus = 1, it works well. But when I set it to 2, the both gpu can allocate memory, but only one is running in 100%. and it seems never stop。 Any clue for it ? From the output of nccl-test, the GPUs are connected well:

$ build/all_reduce_perf -b 2m -e 100m -f 10 -g2
# nThread 1 nGpus 2 minBytes 2097152 maxBytes 104857600 step: 10(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    294 on dlc1ovtssn2jsl15-master-0 device  0 [0x00] Tesla P100-PCIE-16GB
#  Rank  1 Group  0 Pid    294 on dlc1ovtssn2jsl15-master-0 device  1 [0x00] Tesla P100-PCIE-16GB
NCCL version 2.11.4+cuda11.4
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     2097152        524288     float     sum      -1    539.7    3.89    3.89      0    537.7    3.90    3.90      0
    20971520       5242880     float     sum      -1   5473.1    3.83    3.83      0   5470.2    3.83    3.83      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.86278

And here are the nccl logs:

[0] NCCL INFO Bootstrap : Using eth0:10.224.144.56<0> [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [0] NCCL INFO Failed to open libibverbs.so[.1] [0] NCCL INFO NET/Socket : Using [0]eth0:10.224.144.56<0> [1]eth1:10.252.7.41<0> [0] NCCL INFO Using network Socket [0] NCCL INFO NCCL version 2.11.4+cuda11.4 [0] NCCL INFO Channel 00/02 : 0 1 [0] NCCL INFO Channel 01/02 : 0 1 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via direct shared memory [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via direct shared memory [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via direct shared memory [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via direct shared memory [0] NCCL INFO Connected all rings [1] NCCL INFO Connected all rings [0] NCCL INFO Connected all trees [1] NCCL INFO Connected all trees [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer [0] NCCL INFO comm 0x7f2d44054230 rank 0 nranks 2 cudaDev 0 busId 80 - Init COMPLETE [1] NCCL INFO comm 0x7f2d24056d60 rank 1 nranks 2 cudaDev 1 busId 90 - Init COMPLETE [0] NCCL INFO Launch mode Parallel

And the full stack info like this:

#0  0x00007ffff7fd0abc in clock_gettime ()
#1  0x00007ffff7981121 in __GI___clock_gettime (clock_id=4, tp=0x7fffffffcd10) at ../sysdeps/unix/sysv/linux/clock_gettime.c:38
#2  0x00007fff0aa0b0af in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fff0a9310a3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fff0a8d31cf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fff0a8d4818 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fff0a9c096a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fff855517c9 in __cudart1044 () from /usr/local/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so
#8  0x00007fff855862e5 in cudaDeviceSynchronize () from /usr/local/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so
#9  0x00007fff8532a92b in LightGBM::SynchronizeCUDADevice (file=0x7fff855ea2d8 "/mnt/workspace/lgbm-cuda-install/LightGBM/src/io/cuda/cuda_tree.cu", line=454)
    at /mnt/workspace/lgbm-cuda-install/LightGBM/src/cuda/cuda_utils.cpp:13
#10 0x00007fff853469d3 in LightGBM::CUDATree::LaunchAddPredictionToScoreKernel (this=0x7ffef4fdb8b0, data=0x55555c62caa0, used_data_indices=0x0, num_data=100, score=0x7ffeb3f64200)
    at /mnt/workspace/lgbm-cuda-install/LightGBM/src/io/cuda/cuda_tree.cu:454
#11 0x00007fff85344c90 in LightGBM::CUDATree::AddPredictionToScore (this=0x7ffef4fdb8b0, data=0x55555c62caa0, num_data=100, score=0x7ffeb3f64200)
    at /mnt/workspace/lgbm-cuda-install/LightGBM/src/io/cuda/cuda_tree.cpp:256
#12 0x00007fff85316f21 in LightGBM::CUDAScoreUpdater::AddScore (this=0x555569b3ea50, tree=0x7ffef4fdb8b0, cur_tree_id=0)
    at /mnt/workspace/lgbm-cuda-install/LightGBM/src/boosting/cuda/cuda_score_updater.cpp:58
#13 0x00007fff8531ca43 in LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}::operator()(LightGBM::NCCLGBDTComponent*) const (
    this=0x55555b24c200, thread_data=0x5555582d6cc0) at /mnt/workspace/lgbm-cuda-install/LightGBM/src/boosting/cuda/nccl_gbdt.cpp:129
#14 0x00007fff85322216 in std::__invoke_impl<void, LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}&, LightGBM::NCCLGBDTComponent*>(std::__invoke_other, LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}&, LightGBM::NCCLGBDTComponent*&&) (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60
#15 0x00007fff85320dc6 in std::__invoke_r<void, LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}&, LightGBM::NCCLGBDTComponent*>(LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}&, LightGBM::NCCLGBDTComponent*&&) (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#16 0x00007fff8531f6b7 in std::_Function_handler<void (LightGBM::NCCLGBDTComponent*), LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}>::_M_invoke(std::_Any_data const&, LightGBM::NCCLGBDTComponent*&&) (__functor=..., __args#0=@0x7fffffffd4d0: 0x5555582d6cc0) at /usr/include/c++/10/bits/std_function.h:291
#17 0x00007fff8531f161 in std::function<void (LightGBM::NCCLGBDTComponent*)>::operator()(LightGBM::NCCLGBDTComponent*) const (this=0x7fffffffd670, __args#0=0x5555582d6cc0)
    at /usr/include/c++/10/bits/std_function.h:622
#18 0x00007fff8531e1f5 in LightGBM::NCCLTopology::RunOnMasterDevice<LightGBM::NCCLGBDTComponent, void>(std::vector<std::unique_ptr<LightGBM::NCCLGBDTComponent, std::default_delete<LightGBM::NCCLGBDTComponent> >, std::allocator<std::unique_ptr<LightGBM::NCCLGBDTComponent, std::default_delete<LightGBM::NCCLGBDTComponent> > > > const&, std::function<void (LightGBM::NCCLGBDTComponent*)> const&) (this=0x555558d53cb0, 
    objs=std::vector of length 2, capacity 2 = {...}, func=...) at /mnt/workspace/lgbm-cuda-install/LightGBM/include/LightGBM/cuda/cuda_nccl_topology.hpp:185
#19 0x00007fff8531b30b in LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter (this=0x55555b24c200, gradients=0x0, hessians=0x0) at /mnt/workspace/lgbm-cuda-install/LightGBM/src/boosting/cuda/nccl_gbdt.cpp:124
#20 0x00007fff84ce7b4e in LightGBM::Booster::TrainOneIter (this=0x55555b6fe030) at /mnt/workspace/lgbm-cuda-install/LightGBM/src/c_api.cpp:407
#21 0x00007fff84cd3f70 in LGBM_BoosterUpdateOneIter (handle=0x55555b6fe030, is_finished=0x7fff82b14130) at /mnt/workspace/lgbm-cuda-install/LightGBM/src/c_api.cpp:2070

Sep 20 '24 10:09 flybywind