speechbrain
speechbrain copied to clipboard
Unable to continue training of MetricGAN+ from the saved checkpoint
Hi,
I'm training MetricGAN+ on my own dataset. Training was OK for more than 200 epochs but then failed due to some reason. I tried to re-run it from the latest saved checkpoint but got a strange error. When I tried to re-run it from previously saved checkpoint (best wrt. PESQ) the error repeated.
The log of error is below:
.... speechbrain.utils.epoch_loop - Going into epoch 203 Discriminator training by current data... 0%| | 0/100 [00:00<?, ?it/s] speechbrain.core - Exception: Traceback (most recent call last): File "train_v2m.py", line 616, in <module> se_brain.fit( File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/speechbrain/core.py", line 1129, in fit self._fit_train(train_set=train_set, epoch=epoch, enable=enable) File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/speechbrain/core.py", line 960, in _fit_train self.on_stage_start(Stage.TRAIN, epoch) File "train_v2m.py", line 374, in on_stage_start self.train_discriminator() File "train_v2m.py", line 389, in train_discriminator self.fit( File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/speechbrain/core.py", line 1129, in fit self._fit_train(train_set=train_set, epoch=epoch, enable=enable) File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/speechbrain/core.py", line 985, in _fit_train loss = self.fit_batch(batch) File "train_v2m.py", line 318, in fit_batch self.d_optimizer.step() File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(*args, **kwargs) File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/optim/adam.py", line 157, in step adam(params_with_grad, File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/optim/adam.py", line 213, in adam func(params, File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors.
Please advice how to overcome this and proceed training
P.S. my code is completely the same as original except data preparation
This looks like an issue created by Pytorch 1.12 see. As a verification, could you try to downgrade to PyTorch 1.11 and start again ?
If the problem was adam-related, this has been resolved with pytorch 1.12.1
On Wed, Aug 10, 2022, 4:56 AM Parcollet Titouan @.***> wrote:
This looks like an issue created by Pytorch 1.12 see https://github.com/pytorch/pytorch/issues/80809. As a verification, could you try to downgrade to PyTorch 1.11 and start again ?
— Reply to this email directly, view it on GitHub https://github.com/speechbrain/speechbrain/issues/1532#issuecomment-1210370530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVXM6TFITN63GSRIJFDVYNVE5ANCNFSM55U6P2WQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you very much, upgrading pytorch to 1.12.1 helps!