D-FINE icon indicating copy to clipboard operation
D-FINE copied to clipboard

单卡训练

Open suyu-star opened this issue 1 year ago • 5 comments

你好,我想转变成单卡训练,我修改脚本为 CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/dfine/dfine_hgnetv2_l_coco.yml --use-amp --seed=0 但是会出现如下报错 k: [10,0,0], thread: [21,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [22,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [26,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [27,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], ator(): block: [3,0,0], thread: [122,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [123,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [4,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [5,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [6,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [10,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [12,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [13,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [14,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [15,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [19,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [20,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [21,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [22,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [23,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [24,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [28,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [29,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [30,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [31,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. Traceback (most recent call last): File "/home/lee/yuechu/D-FINE/train.py", line 84, in main(args) File "/home/lee/yuechu/D-FINE/train.py", line 54, in main solver.fit() File "/home/lee/yuechu/D-FINE/src/solver/det_solver.py", line 63, in fit train_stats = train_one_epoch( ^^^^^^^^^^^^^^^^ File "/home/lee/yuechu/D-FINE/src/solver/det_engine.py", line 78, in train_one_epoch loss_dict = criterion(outputs, targets, **metas) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lee/anaconda3/envs/yolo/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lee/yuechu/D-FINE/src/zoo/dfine/dfine_criterion.py", line 236, in forward indices = self.matcher(outputs_without_aux, targets)['indices'] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lee/anaconda3/envs/yolo/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lee/anaconda3/envs/yolo/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/lee/yuechu/D-FINE/src/zoo/dfine/matcher.py", line 102, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lee/yuechu/D-FINE/src/zoo/dfine/box_ops.py", line 54, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

suyu-star avatar Nov 27 '24 09:11 suyu-star

Same issue +1

prashant-dn avatar Dec 09 '24 07:12 prashant-dn

I solved it this way.

  1. My train json was truncated... had to re-generate it.
  2. The issue says that some tensors (in gpu) can't be accessed hence the assertion couldn't be made. If you simple use a debugger to print the tgt_bbox etc, you will get same error. You can copy them to CPU to print but its not possible to do this for other tensors. The only solution is to use
export TORCH_USE_CUDA_DSA=1 && \
export CUDA_LAUNCH_BLOCKING=1 && \
clear && export model=m && CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 /envs/dfine/bin/python3 -m torch.distributed.run \
    --master_port=7777 \
    --nproc_per_node=8 \
    train.py -c configs/dfine/custom/dfine_hgnetv2_${model}_custom.yml \
    --use-amp --seed=0 --summary-dir tboard

prashant-dn avatar Dec 09 '24 11:12 prashant-dn

@prashant-dn How did you solve it? I don't understand your solution

nikky4D avatar Dec 15 '24 05:12 nikky4D

@nikky4D first ensure that your annotation json is a valid json. Second run the python3 command using the flags TORCH_USE_CUDA_DSA=1 and CUDA_LAUNCH_BLOCKING=1 as shown in above comment.

prashant-dn avatar Dec 16 '24 04:12 prashant-dn

@nikky4D Actually the problem is the labels - The class ids in my COCO annotation json file were not 0-indexed i.e., the first class was 1 and not 0, while D-Fine expected them to be 0-indexed. That is why you are getting this error too.

Rather do this to fix : In file, D-FINE/src/zoo/dfine/matcher.py, make the following edit Screenshot 2025-01-03 at 3 17 23 PM

prashant-dn avatar Jan 03 '25 09:01 prashant-dn