ptlflow
ptlflow copied to clipboard
ISSUE with RPKNET: not able to train
I have gone through your paper, it's very great way to estimate flow.
i have tried to reciprocate the results using your setup as given in your project:
i have two titan rtx 24 GB VRAM
(ptlflow) anil@anil-gpu2:/media/anil/New Volume/nihal/ptlflow$ python train.py rpknet --random_seed 1234 --gradient_clip_val 1.0 --lr 2.5e-4 --wdecay 1e-4 --gamma 0.8 --gpus 2 --train_dataset chairs --train_batch_size 10 --max_epochs 35 --pyramid_ranges 32 8 --iters 12 --corr_mode allpairs --not_cache_pkconv_weights
/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argument(try 40 which is the number of cpus on this machine) in the
DataLoaderinit to improve performance. rank_zero_warn( /home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 1, does not have many workers which may be a bottleneck. Consider increasing the value of the
num_workers argument
(try 40 which is the number of cpus on this machine) in the DataLoader
init to improve performance.
rank_zero_warn(
/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 2, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argument(try 40 which is the number of cpus on this machine) in the
DataLoaderinit to improve performance. rank_zero_warn( 07/23/2024 13:32:09 - INFO: Loading 22872 samples from FlyingChairs dataset. 07/23/2024 13:32:10 - INFO: Loading 22872 samples from FlyingChairs dataset. Epoch 0: 0%| | 0/2286 [00:00<?, ?it/s]/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:83: UserWarning: Trying to infer the
batch_sizefrom an ambiguous collection. The batch size we found is 10. To avoid any miscalculations, use
self.log(..., batch_size=batch_size)`.
warning_cache.warn(
/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/autograd/init.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [576, 128, 1, 1], strides() = [128, 1, 128, 128]
bucket_view.sizes() = [576, 128, 1, 1], strides() = [128, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:320.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/autograd/init.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [576, 128, 1, 1], strides() = [128, 1, 128, 128]
bucket_view.sizes() = [576, 128, 1, 1], strides() = [128, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:320.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Epoch 0: 17%|███▏ | 400/2286 [26:41<2:05:51, 4.00s/it, loss=38.8, v_num=0, epe_step=11.00]
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/rpknet.py", line 310, in forward
flow_predictions, flow_small, flow_up = self.predict(image1, image2, flow_init)
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/rpknet.py", line 326, in predict
x1_pyramid = self.fnet(x1_raw)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk_encoder.py", line 192, in forward
x = self.rec_stage(h, out_ch).contiguous()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk.py", line 342, in forward
x = blk(x, out_ch=out_ch)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk.py", line 220, in forward
* self.attn(self.norm1(x))
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk.py", line 156, in forward
x = self.spatial_gating_unit(x, out_ch=out_ch)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk.py", line 128, in forward
y = y + self.conv1_branches[0](y, out_ch=out_ch)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv.py", line 208, in forward
res = pkconv2d(
File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv.py", line 79, in pkconv2d
w = weight[slice_idx, : x.shape[1] // bounded_groups]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent) Aborted (core dumped)
i have tried most of things, tried to debug using CUDA_LAUNCH_BLOCKING=1 and TORCh_CUDA_USE_DSA=1
used single gpu, reducing batch size to 4
none of this worked
so i fed up and raisinig issue here