Traceback (most recent call last):
File "/home/ps/ZW/pyskl/tools/train.py", line 177, in
main()
File "/home/ps/ZW/pyskl/tools/train.py", line 169, in main
train_model(model, datasets, cfg, validate=args.validate, test=test_option, timestamp=timestamp, meta=meta)
File "/home/ps/ZW/pyskl/pyskl/apis/train.py", line 153, in train_model
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/ps/anaconda3/envs/pyskl/lib/python3.10/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/ps/anaconda3/envs/pyskl/lib/python3.10/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/home/ps/anaconda3/envs/pyskl/lib/python3.10/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
getattr(hook, fn_name)(self)
File "/home/ps/anaconda3/envs/pyskl/lib/python3.10/site-packages/mmcv/runner/hooks/optimizer.py", line 59, in after_train_iter
grad_norm = self.clip_grads(runner.model.parameters())
File "/home/ps/anaconda3/envs/pyskl/lib/python3.10/site-packages/mmcv/runner/hooks/optimizer.py", line 50, in clip_grads
return clip_grad.clip_grad_norm_(params, **self.grad_clip)
File "/home/ps/anaconda3/envs/pyskl/lib/python3.10/site-packages/torch/nn/utils/clip_grad.py", line 76, in clip_grad_norm_
torch.foreach_mul(grads, clip_coef_clamped.to(device)) # type: ignore[call-overload]
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
It seems that the training corrupted brutally and I am not able to locate why.
please help