单卡训练
你好,我想转变成单卡训练,我修改脚本为
CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/dfine/dfine_hgnetv2_l_coco.yml --use-amp --seed=0
但是会出现如下报错
k: [10,0,0], thread: [21,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [22,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [26,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [27,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], ator(): block: [3,0,0], thread: [122,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [123,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [4,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [5,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [6,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [10,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [12,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [13,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [14,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [15,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [19,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [20,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [21,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [22,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [23,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [24,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [28,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [29,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [30,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1682343970094/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [31,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):
File "/home/lee/yuechu/D-FINE/train.py", line 84, in TORCH_USE_CUDA_DSA to enable device-side assertions.
Same issue +1
I solved it this way.
- My train json was truncated... had to re-generate it.
- The issue says that some tensors (in gpu) can't be accessed hence the assertion couldn't be made. If you simple use a debugger to print the
tgt_bboxetc, you will get same error. You can copy them to CPU to print but its not possible to do this for other tensors. The only solution is to use
export TORCH_USE_CUDA_DSA=1 && \
export CUDA_LAUNCH_BLOCKING=1 && \
clear && export model=m && CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 /envs/dfine/bin/python3 -m torch.distributed.run \
--master_port=7777 \
--nproc_per_node=8 \
train.py -c configs/dfine/custom/dfine_hgnetv2_${model}_custom.yml \
--use-amp --seed=0 --summary-dir tboard
@prashant-dn How did you solve it? I don't understand your solution
@nikky4D first ensure that your annotation json is a valid json. Second run the python3 command using the flags TORCH_USE_CUDA_DSA=1 and CUDA_LAUNCH_BLOCKING=1 as shown in above comment.
@nikky4D Actually the problem is the labels - The class ids in my COCO annotation json file were not 0-indexed i.e., the first class was 1 and not 0, while D-Fine expected them to be 0-indexed. That is why you are getting this error too.
Rather do this to fix :
In file, D-FINE/src/zoo/dfine/matcher.py, make the following edit