adam optim ERROR:If capturable=False, state_steps should not be CUDA tensors.
Hi, congratulations on your excellent work! I would really appreciate if you could help me through this. So I run
PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_1_0_0 --common.override-kwargs scheduler.cosine.max_lr=0.0075 scheduler.cosine.min_lr=0.00075 optim.weight_decay=0.013 model.classification.mitv2.width_multiplier=1.00 --common.tensorboard-logging --common.accum-freq 4 --common.auto-resume
and trigger the auto-resume mode to continue my last training, and this error occurs
2022-07-03 06:06:18 - LOGS - Exception occurred that interrupted the training. If capturable=False, state_steps shou
ld not be CUDA tensors.
If capturable=False, state_steps should not be CUDA tensors.
Traceback (most recent call last):
File "/home/yu/projects/mobilevit/ml-cvnets/engine/training_engine.py", line 682, in run
train_loss, train_ckpt_metric = self.train_epoch(epoch)
File "/home/yu/projects/mobilevit/ml-cvnets/engine/training_engine.py", line 353, in train_epoch
self.gradient_scalar.step(optimizer=self.optimizer)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 285, in _may
be_opt_step
retval = optimizer.step(*args, **kwargs)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorat
e_context
return func(*args, **kwargs)
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 161, in step
adamw(params_with_grad,
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 218, in adamw
func(params,
File "/home/yu/anaconda3/envs/mobilevit/lib/python3.8/site-packages/torch/optim/adamw.py", line 259, in _single_tenso
r_adamw
assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
And I am 100% sure that CUDNN is enabled, all gpus are available, nothing wrong happens when I first train this.
And here's a nother problem, do you guys have a clue if the training process is slow? Thanks sooooo much!
and my versions:
PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~16.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.23
Python version: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.4.0-210-generic-x86_64-with-glibc2.23
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA TITAN Xp
GPU 1: NVIDIA TITAN Xp
GPU 2: NVIDIA TITAN Xp
GPU 3: NVIDIA TITAN Xp
Nvidia driver version: 465.19.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.0
[pip3] pytorchvideo==0.1.5
[pip3] torch==1.12.0
[pip3] torchvision==0.13.0
[conda] numpy 1.23.0 pypi_0 pypi
[conda] pytorchvideo 0.1.5 pypi_0 pypi
[conda] torch 1.12.0 pypi_0 pypi
[conda] torchvision 0.13.0 pypi_0 pypi
Now I update my cuda to 11.3, but the result doesn't change
@yqi19 It seems that the training fails when trying to load the optimizer states. Could you set capturable=True flag in AdamW optimizer and see if that resolves the issue?
I have the same problem, I tried to set capturable=True flag in [AdamW optimizer] but nothing changed. I received this error: "AssertionError: If capturable=False, state_steps should not be CUDA tensors.".