Model training stops after "INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration."
Tried standard model training several times, and each time I get to this point and it just stops, and then eventually times out. Here's the entire contents of the command module from the point where I started model training:
write filelist done use gpus: 0 runtime\python.exe train_nsf_sim_cache_sid_load_pretrain.py -e "Lecturer" -sr 40k -f0 1 -bs 2 -g 0 -te 250 -se 10 -pg pretrained_v2/f0G40k.pth -pd pretrained_v2/f0D40k.pth -l 1 -c 0 -sw 1 -v v2 NO GPU DETECTED: falling back to CPU - this may take a while INFO:Lecturer:{'train': {'log_interval': 200, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 2, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 12800, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'max_wav_value': 32768.0, 'sampling_rate': 40000, 'filter_length': 2048, 'hop_length': 400, 'win_length': 2048, 'n_mel_channels': 125, 'mel_fmin': 0.0, 'mel_fmax': None, 'training_files': './logs\Lecturer/filelist.txt'}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 10, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'use_spectral_norm': False, 'gin_channels': 256, 'spk_embed_dim': 109}, 'model_dir': './logs\Lecturer', 'experiment_dir': './logs\Lecturer', 'save_every_epoch': 10, 'name': 'Lecturer', 'total_epoch': 250, 'pretrainG': 'pretrained_v2/f0G40k.pth', 'pretrainD': 'pretrained_v2/f0D40k.pth', 'version': 'v2', 'gpus': '0', 'sample_rate': '40k', 'if_f0': 1, 'if_latest': 1, 'save_every_weights': '1', 'if_cache_data_in_gpu': 0} INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. gin_channels: 256 self.spk_embed_dim: 109 INFO:Lecturer:loaded pretrained pretrained_v2/f0G40k.pth <All keys matched successfully> INFO:Lecturer:loaded pretrained pretrained_v2/f0D40k.pth <All keys matched successfully> C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.) return VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\torch\autograd_init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [64, 1, 4], strides() = [4, 1, 1] bucket_view.sizes() = [64, 1, 4], strides() = [4, 4, 1] (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\reducer.cpp:337.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass INFO:Lecturer:Train Epoch: 1 [0%] INFO:Lecturer:[0, 0.0001] INFO:Lecturer:loss_disc=3.692, loss_gen=5.756, loss_fm=20.651,loss_mel=42.291, loss_kl=9.000 DEBUG:matplotlib:matplotlib data path: C:\Users\JOSEP\OneDrive\Desktop\RVC0813AMD_Intel\RVC0813AMD_Intel\runtime\lib\site-packages\matplotlib\mpl-data DEBUG:matplotlib:CONFIGDIR=C:\Users\JOSEP.matplotlib DEBUG:matplotlib:interactive is False DEBUG:matplotlib:platform is win32 INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
Also tried the Nvidia version before and discovered my GPU was not Nvidia brand, so I witched to the AMD/Intel version. Here's a screenshot of the bottom of my gui if it helps any. Not entirely sure what to do.
Nevermind I had no idea what the fuck I was doing. Think I've got it now
I've the same problem as you, can you explain how did you do it ? Thx
I have ran in the same problem can u explain how u did it?
Nevermind I had no idea what the fuck I was doing. Think I've got it now
maybe you could tell us how you solved this problem?
this due to a tensorflow incompatibility related packages :
tensorboard and tensorflow-estimator and keras have tried this command :
pip3 install --upgrade keras tensorboard tensorflow-estimator
i find alternative solution for packages compatibility by got to folder of RVC :
cd Retrieval-based-Voice-Conversion-WebUI and pip3 install -r requirements.txt
or if you have amd ---> pip3 install -r requirements-amd.txt and it's if show you
incompatibility of many packages linked to lastest command i'ts this
Any idea how did you solve it?
Any idea how did you solve it?
i mean it's due to venv, i search an implementation CLI for this process and il be back, my laptop is a macbook pro 2011 4GB / Ram and last train i've finished has taked 4hours due to small performance of my laptop
I've the same problem as you, can you explain how did you do it ? Thx I have ran in the same problem can u explain how u did it?
Got the same. It's just the pc you're using is a bit slow. Try on lover values on training. My first Epoch appeared after 8 minutes on v2 INFO:Test:====> Epoch: 1 [2023-10-22 15:15:29] | (0:07:41.536756)
post scriptum: To complete and install all requirements I had to install a base version of Visual Studio.
I'm creating my first model and was worried too. It doesn't stops, just your pc is not fast enough. Just have patience
then a new line will appear like ====>Epoch: 1 [2023-10-26 22:56:11] | (0:03:51.914353)
and after more intense minutes... ====> Epoch: 2 [2023-10-26 22:59:35] | (0:03:24.230118)
and I guess it goes on till it reach the epochs you've chosen. I guess the best time for doing this is before going to sleep.
wait for it some time
Да, тоже мучаюсь. У меня 2 видеокарты Radeon, а вот nVidia, к сожалению, ни одной. Всё пытаюсь запуститься в режиме DirectML. Но в программировании ноль с хвостиком. Мне понравилась версия Applio-RVC-Fork (не третья версия). У неё в интерфейсе есть нужная фича - остановка если нет прогресса в течении 100 эпох. Раньше я на ней гнал - всё быстро так - оказалось программа неправильно делает экстракт, в общем чудом нашёл в коде хитрость - если не указывать номер GPU, а вместо этого поставить просто минус "-", то экстракт идёт в режиме GPU без проблем - включается f0 модуль экстрагирования DML. Но программа не допиленная, т.к. в конце концов в самой тренировке модели не реализовали DML. В новых версиях 3.0.x DML вообще даже не установлен - видимо из-за того, что сам разработчик забил.
This issue was closed because it has been inactive for 15 days since being marked as stale.