Incurred problem during training on 4090

Open ZhenshengWu opened this issue 10 months ago • 1 comments

CUDA version Thu Mar 28 13:09:21 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA Graphics... On | 00000000:17:00.0 Off | Off | | 66% 28C P8 25W / 450W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA Graphics... On | 00000000:18:00.0 Off | Off | | 66% 33C P8 27W / 450W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA Graphics... On | 00000000:31:00.0 Off | Off | | 66% 29C P8 23W / 450W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA Graphics... On | 00000000:32:00.0 Off | Off | | 65% 29C P8 18W / 450W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA Graphics... On | 00000000:4B:00.0 Off | Off | | 68% 28C P8 22W / 450W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA Graphics... On | 00000000:67:00.0 Off | Off | | 66% 32C P8 29W / 450W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA Graphics... On | 00000000:98:00.0 Off | Off | | 63% 35C P8 17W / 450W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA Graphics... On | 00000000:E3:00.0 Off | Off | | 66% 32C P8 25W / 450W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

run command like: python3 -m torch.distributed.run --nproc_per_node 8 --master_port 29519 train.py --sync-bn --cfg cfg_fire_and_smoke/yolov7_fire_smoke.yaml --data cfg_fire_and_smoke/fire_smoke_data.yaml --img-size 640 --batch-size 16 --weights '' --device 0,1,2,3,4,5,6,7 this would be run well with anthor GPU

i got error： /root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Model Summary: 415 layers, 37201950 parameters, 37201950 gradients, 105.1 GFLOPS

/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "train.py", line 616, in train(hyp, opt, device, tb_writer) File "train.py", line 95, in train model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply Traceback (most recent call last): File "train.py", line 616, in module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply train(hyp, opt, device, tb_writer) File "train.py", line 95, in train model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to param_applied = fn(param) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

return self._apply(convert)

File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last): File "train.py", line 616, in train(hyp, opt, device, tb_writer) File "train.py", line 95, in train model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last): File "train.py", line 616, in Traceback (most recent call last): File "train.py", line 616, in train(hyp, opt, device, tb_writer) File "train.py", line 95, in train train(hyp, opt, device, tb_writer) File "train.py", line 95, in train model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply return self._apply(convert) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert param_applied = fn(param) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert Traceback (most recent call last): File "train.py", line 616, in return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

train(hyp, opt, device, tb_writer)

File "train.py", line 95, in train model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13378) of binary: /root/anaconda3/envs/test/bin/python3 Traceback (most recent call last): File "/root/anaconda3/envs/test/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/anaconda3/envs/test/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/run.py", line 798, in main() File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2024-03-28_13:15:41 host : root123 rank : 1 (local_rank: 1) exitcode : 1 (pid: 13379) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-28_13:15:41 host : root123 rank : 2 (local_rank: 2) exitcode : 1 (pid: 13380) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-28_13:15:41 host : root123 rank : 3 (local_rank: 3) exitcode : 1 (pid: 13381) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-03-28_13:15:41 host : root123 rank : 4 (local_rank: 4) exitcode : 1 (pid: 13382) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-03-28_13:15:41 host : root123 rank : 5 (local_rank: 5) exitcode : 1 (pid: 13383) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-03-28_13:15:41 host : root123 rank : 6 (local_rank: 6) exitcode : 1 (pid: 13384) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-03-28_13:15:41 host : root123 rank : 7 (local_rank: 7) exitcode : 1 (pid: 13385) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-28_13:15:41 host : root123 rank : 0 (local_rank: 0) exitcode : 1 (pid: 13378) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I do not know why i got this，how should I solve this problem, thanks very much

Mar 28 '24 13:03 ZhenshengWu

The error occurs because you ran out of memory on your GPU, One way to solve it is to reduce the batch size until your code runs without this error.

May 15 '24 17:05 MayeshMohapatra

yolov7 yolov7 copied to clipboard

Incurred problem during training on 4090

train.py FAILED

Root Cause (first observed failure): [0]: time : 2024-03-28_13:15:41 host : root123 rank : 0 (local_rank: 0) exitcode : 1 (pid: 13378) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

yolov7
yolov7 copied to clipboard