ptlflow ISSUE with RPKNET: not able to train

ISSUE with RPKNET: not able to train

Open nihalgupta84 opened this issue 7 months ago • 7 comments

I have gone through your paper, it's very great way to estimate flow.

i have tried to reciprocate the results using your setup as given in your project:

i have two titan rtx 24 GB VRAM

(ptlflow) anil@anil-gpu2:/media/anil/New Volume/nihal/ptlflow$ python train.py rpknet --random_seed 1234 --gradient_clip_val 1.0 --lr 2.5e-4 --wdecay 1e-4 --gamma 0.8 --gpus 2 --train_dataset chairs --train_batch_size 10 --max_epochs 35 --pyramid_ranges 32 8 --iters 12 --corr_mode allpairs --not_cache_pkconv_weights

/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 40 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. rank_zero_warn( /home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 1, does not have many workers which may be a bottleneck. Consider increasing the value of thenum_workers argument (try 40 which is the number of cpus on this machine) in the DataLoader init to improve performance. rank_zero_warn( /home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 2, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 40 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. rank_zero_warn( 07/23/2024 13:32:09 - INFO: Loading 22872 samples from FlyingChairs dataset. 07/23/2024 13:32:10 - INFO: Loading 22872 samples from FlyingChairs dataset. Epoch 0: 0%| | 0/2286 [00:00<?, ?it/s]/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:83: UserWarning: Trying to infer thebatch_sizefrom an ambiguous collection. The batch size we found is 10. To avoid any miscalculations, useself.log(..., batch_size=batch_size)`. warning_cache.warn( /home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/autograd/init.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [576, 128, 1, 1], strides() = [128, 1, 128, 128] bucket_view.sizes() = [576, 128, 1, 1], strides() = [128, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:320.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/autograd/init.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [576, 128, 1, 1], strides() = [128, 1, 128, 128] bucket_view.sizes() = [576, 128, 1, 1], strides() = [128, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:320.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Epoch 0: 17%|███▏ | 400/2286 [26:41<2:05:51, 4.00s/it, loss=38.8, v_num=0, epe_step=11.00] File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/rpknet.py", line 310, in forward flow_predictions, flow_small, flow_up = self.predict(image1, image2, flow_init) File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/rpknet.py", line 326, in predict x1_pyramid = self.fnet(x1_raw) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk_encoder.py", line 192, in forward

x = self.rec_stage(h, out_ch).contiguous()

File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk.py", line 342, in forward x = blk(x, out_ch=out_ch) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk.py", line 220, in forward * self.attn(self.norm1(x)) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk.py", line 156, in forward x = self.spatial_gating_unit(x, out_ch=out_ch) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv_slk.py", line 128, in forward y = y + self.conv1_branches[0](y, out_ch=out_ch) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv.py", line 208, in forward res = pkconv2d( File "/media/anil/New Volume/nihal/ptlflow/ptlflow/models/rpknet/pkconv.py", line 79, in pkconv2d w = weight[slice_idx, : x.shape[1] // bounded_groups] RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent) Aborted (core dumped)

i have tried most of things, tried to debug using CUDA_LAUNCH_BLOCKING=1 and TORCh_CUDA_USE_DSA=1

used single gpu, reducing batch size to 4

none of this worked

so i fed up and raisinig issue here

Jul 23 '24 08:07 nihalgupta84

ptlflow ptlflow copied to clipboard

ISSUE with RPKNET: not able to train

ptlflow
ptlflow copied to clipboard