Retrieval-based-Voice-Conversion-WebUI icon indicating copy to clipboard operation
Retrieval-based-Voice-Conversion-WebUI copied to clipboard

Model training stops after "INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration."

Open SpringtrapISZ opened this issue 2 years ago • 11 comments

Tried standard model training several times, and each time I get to this point and it just stops, and then eventually times out. Here's the entire contents of the command module from the point where I started model training:

write filelist done use gpus: 0 runtime\python.exe train_nsf_sim_cache_sid_load_pretrain.py -e "Lecturer" -sr 40k -f0 1 -bs 2 -g 0 -te 250 -se 10 -pg pretrained_v2/f0G40k.pth -pd pretrained_v2/f0D40k.pth -l 1 -c 0 -sw 1 -v v2 NO GPU DETECTED: falling back to CPU - this may take a while INFO:Lecturer:{'train': {'log_interval': 200, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 2, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 12800, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'max_wav_value': 32768.0, 'sampling_rate': 40000, 'filter_length': 2048, 'hop_length': 400, 'win_length': 2048, 'n_mel_channels': 125, 'mel_fmin': 0.0, 'mel_fmax': None, 'training_files': './logs\Lecturer/filelist.txt'}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 10, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'use_spectral_norm': False, 'gin_channels': 256, 'spk_embed_dim': 109}, 'model_dir': './logs\Lecturer', 'experiment_dir': './logs\Lecturer', 'save_every_epoch': 10, 'name': 'Lecturer', 'total_epoch': 250, 'pretrainG': 'pretrained_v2/f0G40k.pth', 'pretrainD': 'pretrained_v2/f0D40k.pth', 'version': 'v2', 'gpus': '0', 'sample_rate': '40k', 'if_f0': 1, 'if_latest': 1, 'save_every_weights': '1', 'if_cache_data_in_gpu': 0} INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. gin_channels: 256 self.spk_embed_dim: 109 INFO:Lecturer:loaded pretrained pretrained_v2/f0G40k.pth <All keys matched successfully> INFO:Lecturer:loaded pretrained pretrained_v2/f0D40k.pth <All keys matched successfully> C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\autograd_init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [64, 1, 4], strides() = [4, 1, 1] bucket_view.sizes() = [64, 1, 4], strides() = [4, 4, 1] (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\reducer.cpp:337.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass INFO:Lecturer:Train Epoch: 1 [0%] INFO:Lecturer:[0, 0.0001] INFO:Lecturer:loss_disc=3.692, loss_gen=5.756, loss_fm=20.651,loss_mel=42.291, loss_kl=9.000 DEBUG:matplotlib:matplotlib data path: C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\matplotlib\mpl-data DEBUG:matplotlib:CONFIGDIR=C:\Users\JOSEP.matplotlib DEBUG:matplotlib:interactive is False DEBUG:matplotlib:platform is win32 INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.

Also tried the Nvidia version before and discovered my GPU was not Nvidia brand, so I witched to the AMD/Intel version. Here's a screenshot of the bottom of my gui if it helps any. Not entirely sure what to do. Screenshot (73)

SpringtrapISZ avatar Aug 26 '23 04:08 SpringtrapISZ

Nevermind I had no idea what the fuck I was doing. Think I've got it now

SpringtrapISZ avatar Aug 26 '23 06:08 SpringtrapISZ

I've the same problem as you, can you explain how did you do it ? Thx

Leynoxxx avatar Sep 05 '23 11:09 Leynoxxx

I have ran in the same problem can u explain how u did it?

mannequin74 avatar Sep 24 '23 11:09 mannequin74

Nevermind I had no idea what the fuck I was doing. Think I've got it now

maybe you could tell us how you solved this problem?

tzwel avatar Sep 27 '23 06:09 tzwel

this due to a tensorflow incompatibility related packages :

tensorboard and tensorflow-estimator and keras have tried this command :

pip3 install --upgrade keras tensorboard tensorflow-estimator

i find alternative solution for packages compatibility by got to folder of RVC :

cd Retrieval-based-Voice-Conversion-WebUI and pip3 install -r requirements.txt

or if you have amd ---> pip3 install -r requirements-amd.txt and it's if show you

incompatibility of many packages linked to lastest command i'ts this

enokseth avatar Sep 28 '23 21:09 enokseth

Any idea how did you solve it?

mouneero avatar Sep 29 '23 16:09 mouneero

Any idea how did you solve it?

i mean it's due to venv, i search an implementation CLI for this process and il be back, my laptop is a macbook pro 2011 4GB / Ram and last train i've finished has taked 4hours due to small performance of my laptop

enokseth avatar Sep 30 '23 15:09 enokseth

I've the same problem as you, can you explain how did you do it ? Thx I have ran in the same problem can u explain how u did it?

Got the same. It's just the pc you're using is a bit slow. Try on lover values on training. My first Epoch appeared after 8 minutes on v2 INFO:Test:====> Epoch: 1 [2023-10-22 15:15:29] | (0:07:41.536756)

post scriptum: To complete and install all requirements I had to install a base version of Visual Studio.

Valyev avatar Oct 22 '23 12:10 Valyev

I'm creating my first model and was worried too. It doesn't stops, just your pc is not fast enough. Just have patience

then a new line will appear like ====>Epoch: 1 [2023-10-26 22:56:11] | (0:03:51.914353)

and after more intense minutes... ====> Epoch: 2 [2023-10-26 22:59:35] | (0:03:24.230118)

and I guess it goes on till it reach the epochs you've chosen. I guess the best time for doing this is before going to sleep.

deadniell avatar Oct 27 '23 03:10 deadniell

wait for it some time

hixuanyu avatar Dec 17 '23 14:12 hixuanyu

Да, тоже мучаюсь. У меня 2 видеокарты Radeon, а вот nVidia, к сожалению, ни одной. Всё пытаюсь запуститься в режиме DirectML. Но в программировании ноль с хвостиком. Мне понравилась версия Applio-RVC-Fork (не третья версия). У неё в интерфейсе есть нужная фича - остановка если нет прогресса в течении 100 эпох. Раньше я на ней гнал - всё быстро так - оказалось программа неправильно делает экстракт, в общем чудом нашёл в коде хитрость - если не указывать номер GPU, а вместо этого поставить просто минус "-", то экстракт идёт в режиме GPU без проблем - включается f0 модуль экстрагирования DML. Но программа не допиленная, т.к. в конце концов в самой тренировке модели не реализовали DML. В новых версиях 3.0.x DML вообще даже не установлен - видимо из-за того, что сам разработчик забил.

SystemFaifure avatar Mar 25 '24 10:03 SystemFaifure

This issue was closed because it has been inactive for 15 days since being marked as stale.

github-actions[bot] avatar May 11 '24 04:05 github-actions[bot]