Retrieval-based-Voice-Conversion-WebUI
Retrieval-based-Voice-Conversion-WebUI copied to clipboard
[Bug] RuntimeError: CUDA error: an illegal instruction was encountered
Using an Ubuntu system, 2x3060 (12g ea) and the latest version of RVC, commit c4a1810
During training, after a few epochs complete, a CUDA error is thrown:
INFO:user-test-3:====> Epoch: 1
INFO:user-test-3:Train Epoch: 2 [11%]
INFO:user-test-3:[200, 9.99875e-05]
INFO:user-test-3:loss_disc=3.124, loss_gen=2.644, loss_fm=8.702,loss_mel=19.773, loss_kl=1.555
INFO:user-test-3:====> Epoch: 2
INFO:user-test-3:Train Epoch: 3 [22%]
INFO:user-test-3:[400, 9.99750015625e-05]
INFO:user-test-3:loss_disc=3.009, loss_gen=2.687, loss_fm=8.580,loss_mel=19.066, loss_kl=1.653
INFO:user-test-3:====> Epoch: 3
INFO:user-test-3:Train Epoch: 4 [33%]
INFO:user-test-3:[600, 9.996250468730469e-05]
INFO:user-test-3:loss_disc=3.033, loss_gen=2.489, loss_fm=7.798,loss_mel=18.964, loss_kl=1.770
INFO:user-test-3:====> Epoch: 4
INFO:user-test-3:Train Epoch: 5 [44%]
INFO:user-test-3:[800, 9.995000937421877e-05]
INFO:user-test-3:loss_disc=2.957, loss_gen=2.745, loss_fm=7.675,loss_mel=18.730, loss_kl=1.756
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f37fab9e4d7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f37fab6836b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f38008b6fa8 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xdf9d4e (0x7f378a7f9d4e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4ccea6 (0x7f37c90ccea6 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7f37fab83e77 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f37fab7c69e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f37fab7c7b9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x752458 (0x7f37c9352458 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f37c93527e5 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x12c1dc (0x55db69db61dc in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #11: <unknown function> + 0x154b6f (0x55db69ddeb6f in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #12: <unknown function> + 0x167367 (0x55db69df1367 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #13: <unknown function> + 0x167394 (0x55db69df1394 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #14: <unknown function> + 0x167394 (0x55db69df1394 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #15: <unknown function> + 0x171a2c (0x55db69dfba2c in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #16: <unknown function> + 0x132719 (0x55db69dbc719 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #17: <unknown function> + 0x272015 (0x55db69efc015 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x5ae7 (0x55db69dd79e7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #19: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x8c2 (0x55db69dd27c2 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #21: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x6d0 (0x55db69dd25d0 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #23: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x197b (0x55db69dd387b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #25: <unknown function> + 0x144cb4 (0x55db69dcecb4 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #26: PyEval_EvalCode + 0x86 (0x55db69ebb266 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #27: <unknown function> + 0x25d497 (0x55db69ee7497 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #28: <unknown function> + 0x25645e (0x55db69ee045e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #29: PyRun_StringFlags + 0x81 (0x55db69ed8a71 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #30: PyRun_SimpleStringFlags + 0x3c (0x55db69ed894c in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #31: Py_RunMain + 0x377 (0x55db69ed7ae7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #32: Py_BytesMain + 0x2b (0x55db69eaf38b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #33: <unknown function> + 0x23510 (0x7f38a3823510 in /lib/x86_64-linux-gnu/libc.so.6)
frame #34: __libc_start_main + 0x89 (0x7f38a38235c9 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: _start + 0x25 (0x55db69eaf285 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
Traceback (most recent call last):
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in <module>
main()
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 50, in main
mp.spawn(
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 202, in run
train_and_evaluate(
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 389, in train_and_evaluate
wave = commons.slice_segments(
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/infer_pack/commons.py", line 49, in slice_segments
ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
After doing a little digging, other people online have encountered this bug with dual GPU setups; I have disabled dual GPU training (only training on one) and that seemed to work up until it hits the error associated with #167
INFO:user-test-3:[14000, 9.976276699833672e-05]
INFO:user-test-3:loss_disc=2.972, loss_gen=3.147, loss_fm=8.040,loss_mel=17.520, loss_kl=0.971
INFO:user-test-3:Saving model and optimizer state at epoch 20 to ./logs/user-test-3/G_14060.pth
INFO:user-test-3:Saving model and optimizer state at epoch 20 to ./logs/user-test-3/D_14060.pth
INFO:user-test-3:====> Epoch: 20
INFO:user-test-3:Training is done. The program is closed.
INFO:user-test-3:saving final ckpt:Success.
Traceback (most recent call last):
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in
What is the version of Pytorch? This bug was found in 1.5.0
I am using a venv - so whatever was installed via the first preamble git instructions + pip requirements -r is what it should be using.
I used this to check: python -c "import torch; print(torch.version)" Received: 2.0.0+cu117
https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/167#issuecomment-1528941884
Which of these GPUs did this occur on?
#167 occurs in both environments I have - one environment is a dual 3060 (12gb ea) and another is a single 1660ti -- I initially try out things on the lower powered card and if they don't work move to the more memory available system and try there.
It looks like the 1660ti doesn't do several steps due to lack of resources, which is fine, but when it goes to the file handling process it crashes. It also happens on the 3060x2 system that can actually do all of the processing/training steps (except for the dual train issue identified in this 215 issue) but then crashes when file writing.
Dual GPUs still broken on current build
INFO:test6:Train Epoch: 6 [56%]
INFO:test6:[1000, 9.993751562304699e-05]
INFO:test6:loss_disc=2.822, loss_gen=3.128, loss_fm=8.556,loss_mel=19.719, loss_kl=1.550
INFO:test6:====> Epoch: 6
INFO:test6:Train Epoch: 7 [67%]
INFO:test6:[1200, 9.99250234335941e-05]
INFO:test6:loss_disc=3.213, loss_gen=2.712, loss_fm=6.322,loss_mel=16.615, loss_kl=1.119
INFO:test6:====> Epoch: 7
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f711e4634d7 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f711e42d36b in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f711e507fa8 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xdf9d4e (0x7f70a83f9d4e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4ccea6 (0x7f70e6cccea6 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7f711e448e77 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f711e44169e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f711e4417b9 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x53b5163 (0x7f70d2bb5163 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupGloo::runLoop(int) + 0x2fe (0x7f70d2bbeb0e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xdc3a3 (0x7f716e6dc3a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #11: <unknown function> + 0x90402 (0x7f71c1490402 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x11f590 (0x7f71c151f590 in /lib/x86_64-linux-gnu/libc.so.6)
Traceback (most recent call last):
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in <module>
main()
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 50, in main
mp.spawn(
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 202, in run
train_and_evaluate(
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 415, in train_and_evaluate
scaler.scale(loss_gen_all).backward()
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Encountering same issue running a single RTX4090