Applio [Bug]: Multi GPU training not working

Project Version

3.5.0

Platform and OS Version

Linux Ubuntu 22.0

Affected Devices

Tried it on Linux only

Existing Issues

No response

What happened?

Multi GPU training doesn't work. Single GPU training works just fine.

GPUs: 2 * RTX 5090 32GB CPU: 48 Cores RAM: 128GB Python: Python 3.12.3

Steps to reproduce

Train a model, starting from a pretrained one, using multiple GPUs
Start Training. It crashes.
Switch to single gpu. Start Training. It works.

Expected behavior

Multi GPU training works

Attachments

No response

Screenshots or Videos

No response

Additional Information

No response

Sep 16 '25 13:09 mhciwnzoq

It crashes.

details?

Sep 16 '25 13:09 AznamirWoW

I no longer have access to that environment unfortunately. That dataset I was training was large. Now I've spun up a new env, and I've tried 2 GPUs with a small file. The training never starts. It's stuck like this forever. If I change the GPU config from 0-1 to 0 then it starts training. Same GPUs in ticket.

Sep 16 '25 18:09 mhciwnzoq

Alright, it took a while to happen, but eventually happened with the small dataset as well. Here it is

Initializing the generator with 109 speakers.
Initializing the generator with 109 speakers.
Using HiFi-GAN vocoder
Using HiFi-GAN vocoder
Using AdamW optimizer
Using AdamW optimizer
Using Single-Scale Mel loss function
Using Single-Scale Mel loss function
Process Process-1:
Process Process-2:
Traceback (most recent call last):
  File "/root/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/Applio/rvc/train/train.py", line 467, in run
    net_g = DDP(net_g, device_ids=[device_id])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/Applio/.venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
    self._ddp_init_helper(
  File "/workspace/Applio/.venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
    self.reducer = dist.Reducer(
                   ^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/root/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/Applio/rvc/train/train.py", line 467, in run
    net_g = DDP(net_g, device_ids=[device_id])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/Applio/.venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
    self._ddp_init_helper(
  File "/workspace/Applio/.venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
    self.reducer = dist.Reducer(
                   ^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank1]:[E916 18:57:50.718714126 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7576f84c75e8 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7576f845c4a2 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7576f85a5422 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7575e8d405a6 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7575e8d50840 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7575e8d523d2 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7575e8d53fdd in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xecdb4 (0x75765326edb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7576fd21caa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: __clone + 0x44 (0x7576fd2a9a34 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7576f84c75e8 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7576f845c4a2 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7576f85a5422 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7575e8d405a6 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7575e8d50840 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7575e8d523d2 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7575e8d53fdd in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xecdb4 (0x75765326edb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7576fd21caa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: __clone + 0x44 (0x7576fd2a9a34 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7576f84c75e8 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcc7b9e (0x7575e8d22b9e in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9165ed (0x7575e89715ed in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xecdb4 (0x75765326edb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x9caa4 (0x7576fd21caa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: __clone + 0x44 (0x7576fd2a9a34 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E916 18:57:50.729571251 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7333547785e8 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x73335470d4a2 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x733355071422 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x733244f405a6 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x733244f50840 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x733244f523d2 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x733244f53fdd in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xecdb4 (0x7332af46edb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x733359475aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: __clone + 0x44 (0x733359502a34 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7333547785e8 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x73335470d4a2 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x733355071422 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x733244f405a6 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x733244f50840 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x733244f523d2 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x733244f53fdd in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xecdb4 (0x7332af46edb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x733359475aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: __clone + 0x44 (0x733359502a34 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7333547785e8 in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcc7b9e (0x733244f22b9e in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9165ed (0x733244b715ed in /workspace/Applio/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xecdb4 (0x7332af46edb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x9caa4 (0x733359475aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: __clone + 0x44 (0x733359502a34 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Sep 16 '25 19:09 mhciwnzoq

Since it crashes before it starts the training just by attempting to wrap the generator model into DDP it is not Applio issue.

https://forums.developer.nvidia.com/t/cudamemset-illegal-memory-access-with-rtx5090-with-570-86-16/323770/21?page=2 https://github.com/pytorch/pytorch/issues/152780

Sep 16 '25 19:09 AznamirWoW

try python -m pip install nvidia-nccl-cu12>2.26.2 as recommended in the pytorch ticket

Sep 16 '25 19:09 AznamirWoW

i tried multi gpu on kaggle 2 weeks ago and everything was fine

Sep 16 '25 19:09 blaisewf

Tried installing nightly cuda.

Tried this as well python -m pip install nvidia-nccl-cu12>2.26.2

The issue persisted.

Funny thing is that extraction works fine on dual GPUs. Training for some reason doesn't.

@blaisewf I'm on latest which was released 3 days ago. Could be something introduced in 3.5

Sep 16 '25 20:09 mhciwnzoq

extraction works because it does not use NCCL, training does not work because something is messed up with nvidia stack.

Sep 17 '25 03:09 AznamirWoW