speechbrain Unable to continue training of MetricGAN+ from the saved checkpoint

Hi,

I'm training MetricGAN+ on my own dataset. Training was OK for more than 200 epochs but then failed due to some reason. I tried to re-run it from the latest saved checkpoint but got a strange error. When I tried to re-run it from previously saved checkpoint (best wrt. PESQ) the error repeated.

The log of error is below: .... speechbrain.utils.epoch_loop - Going into epoch 203 Discriminator training by current data... 0%| | 0/100 [00:00<?, ?it/s] speechbrain.core - Exception: Traceback (most recent call last): File "train_v2m.py", line 616, in <module> se_brain.fit( File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/speechbrain/core.py", line 1129, in fit self._fit_train(train_set=train_set, epoch=epoch, enable=enable) File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/speechbrain/core.py", line 960, in _fit_train self.on_stage_start(Stage.TRAIN, epoch) File "train_v2m.py", line 374, in on_stage_start self.train_discriminator() File "train_v2m.py", line 389, in train_discriminator self.fit( File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/speechbrain/core.py", line 1129, in fit self._fit_train(train_set=train_set, epoch=epoch, enable=enable) File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/speechbrain/core.py", line 985, in _fit_train loss = self.fit_batch(batch) File "train_v2m.py", line 318, in fit_batch self.d_optimizer.step() File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(*args, **kwargs) File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/optim/adam.py", line 157, in step adam(params_with_grad, File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/optim/adam.py", line 213, in adam func(params, File "/mnt/asr/home/korenevsky/anaconda3/envs/py38/lib/python3.8/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors.

Please advice how to overcome this and proceed training

P.S. my code is completely the same as original except data preparation

Aug 05 '22 06:08 kfmn

This looks like an issue created by Pytorch 1.12 see. As a verification, could you try to downgrade to PyTorch 1.11 and start again ?

Aug 10 '22 08:08 TParcollet

If the problem was adam-related, this has been resolved with pytorch 1.12.1

On Wed, Aug 10, 2022, 4:56 AM Parcollet Titouan @.***> wrote:

This looks like an issue created by Pytorch 1.12 see https://github.com/pytorch/pytorch/issues/80809. As a verification, could you try to downgrade to PyTorch 1.11 and start again ?

— Reply to this email directly, view it on GitHub https://github.com/speechbrain/speechbrain/issues/1532#issuecomment-1210370530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVXM6TFITN63GSRIJFDVYNVE5ANCNFSM55U6P2WQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Aug 10 '22 12:08 mravanelli

Thank you very much, upgrading pytorch to 1.12.1 helps!

Aug 22 '22 08:08 kfmn

speechbrain speechbrain copied to clipboard

Unable to continue training of MetricGAN+ from the saved checkpoint

speechbrain
speechbrain copied to clipboard