nsff_pl icon indicating copy to clipboard operation
nsff_pl copied to clipboard

CUDA kernel error

Open phongnhhn92 opened this issue 2 years ago • 1 comments

Hello, I have been trying to train a model on Jumping sequence using your code but I have this cuda kernel error:

python train.py   --dataset_name monocular --root_dir $ROOT_DIR   --img_wh 512 288 --start_end 0 30   --N_samples 128 --N_importance 0 --encode_t --use_viewdir   --num_epochs 50 --batch_size 512   --optimizer adam --lr 5e-4 --lr_scheduler cosine   --exp_name exp
Global seed set to 42
/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py:20: LightningDeprecationWarning: The `pl.plugins.training_type.ddp.DDPPlugin` is deprecated in v1.6 and will be removed in v1.8. Use `pl.strategies.ddp.DDPStrategy` instead.
  rank_zero_deprecation(
/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:312: LightningDeprecationWarning: Passing <pytorch_lightning.plugins.training_type.ddp.DDPPlugin object at 0x7f74534b47c0> `strategy` to the `plugins` flag in Trainer has been deprecated in v1.5 and will be removed in v1.7. Use `Trainer(strategy=<pytorch_lightning.plugins.training_type.ddp.DDPPlugin object at 0x7f74534b47c0>)` instead.
  rank_zero_deprecation(
/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:96: LightningDeprecationWarning: Setting `Trainer(progress_bar_refresh_rate=1)` is deprecated in v1.5 and will be removed in v1.7. Please pass `pytorch_lightning.callbacks.progress.TQDMProgressBar` with `refresh_rate` directly to the Trainer's `callbacks` argument instead. Or, to disable the progress bar pass `enable_progress_bar = False` to the Trainer.
  rank_zero_deprecation(
/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:171: LightningDeprecationWarning: Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed in v1.7. Please set `Trainer(enable_model_summary=False)` instead.
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:154: LightningDeprecationWarning: The `LightningModule.get_progress_bar_dict` method was deprecated in v1.5 and will be removed in v1.7. Please use the `ProgressBarBase.get_metrics` instead.
  rank_zero_deprecation(
Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 0:   1%| | 24/3539 [00:03<09:04,  6.46it/s, loss=0.208, train/col_l=0.0248, train/disp_l=0.0415, train/entropy_l=0.00239, train/cross_entropy_l=0.000, train/flow_fw_l=0.0434, train/flow_bw_l=0.0416, train/pho_l=0.0495, train/

/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [5,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [6,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [8,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [11,0,0], thread: [9,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Traceback (most recent call last):
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 719, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
    self.fit_loop.run()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 268, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
    result = self._run_optimization(
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1644, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/optim/adam.py", line 100, in step
    loss = closure()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 140, in _wrap_closure
    closure_result = closure()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
    step_output = self._step_fn()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 427, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1763, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 341, in training_step
    return self.model(*args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
    output = self.module.training_step(*inputs, **kwargs)
  File "train.py", line 187, in training_step
    loss_d = self.loss(results, batch, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/phong/data/Nvidia inter/Code/nsff_pl/losses.py", line 117, in forward
    if valid_geo_fw.any():
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 316, in <module>
    main(hparams)
  File "train.py", line 300, in main
    trainer.fit(system)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
    self._teardown()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
    self.strategy.teardown()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 447, in teardown
    super().teardown()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/strategies/parallel.py", line 134, in teardown
    super().teardown()
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 444, in teardown
    optimizers_to_device(self.optimizers, torch.device("cpu"))
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/utilities/optimizer.py", line 27, in optimizers_to_device
    optimizer_to_device(opt, device)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/utilities/optimizer.py", line 33, in optimizer_to_device
    optimizer.state[p] = apply_to_collection(v, torch.Tensor, move_data_to_device, device)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 107, in apply_to_collection
    v = apply_to_collection(
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 354, in move_data_to_device
    return apply_to_collection(batch, dtype=dtype, function=batch_to)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/phong/miniconda3/envs/deep/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 347, in batch_to
    data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.


phongnhhn92 avatar May 29 '22 14:05 phongnhhn92

I also met with this problem, have you solved it?

yaoyuan13 avatar Jul 27 '22 10:07 yaoyuan13

After a few times, I just works bro ✌🏼

phongnhhn92 avatar Mar 08 '23 08:03 phongnhhn92