updated reader_factory to correct extra folders

Open RorryB opened this issue 9 months ago • 0 comments

Main train.py does not properly instantiate ImageFolder dataset if --train-split --val-split and --num-classes are specified. It looks like this may only be happening if the main data-dir had additional folders than only data/train and data/val. My issue was a data/test folder.

run 1 (original) CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200 Training with a single process on 1 device (cuda). Model seresnet34 created, param count:21550016 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.485, 0.456, 0.406) std: (0.229, 0.224, 0.225) crop_pct: 0.875 crop_mode: center Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None Using native Torch AMP. Training in mixed precision. Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch. Traceback (most recent call last): File "/pai/train.py", line 1235, in main() File "/pai/train.py", line 888, in main train_metrics = train_one_epoch( ^^^^^^^^^^^^^^^^ File "/pai/train.py", line 1083, in train_one_epoch loss = _forward() ^^^^^^^^^^ File "/pai/train.py", line 1051, in _forward loss = loss_fn(output, target) ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/pai/timm/loss/cross_entropy.py", line 22, in forward nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [70,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [84,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [23,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [100,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [104,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [116,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [119,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [55,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [58,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [63,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

---- making edit nano timm/data/readers/reader_factory.py ---- run 2

CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200 Training with a single process on 1 device (cuda). Model seresnet34 created, param count:21550016 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.485, 0.456, 0.406) std: (0.229, 0.224, 0.225) crop_pct: 0.875 crop_mode: center Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None Using native Torch AMP. Training in mixed precision. Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch. Train: 0 [ 0/390 ( 0%)] Loss: 5.36 (5.36) Time: 1.552s, 164.95/s (1.552s, 164.95/s) LR: 1.000e-05 Data: 0.517 (0.517)

Mar 03 '25 03:03 RorryB