GP-VTON icon indicating copy to clipboard operation
GP-VTON copied to clipboard

CUDA Error

Open philz1337x opened this issue 2 years ago • 2 comments

I run into cuda errors while trying to run the test_tryon.py

Traceback (most recent call last): File "test_tryon.py", line 25, in torch.cuda.set_device(opt.local_rank) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/cuda/init.py", line 311, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "test_tryon.py", line 25, in torch.cuda.set_device(opt.local_rank) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/cuda/init.py", line 311, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "test_tryon.py", line 25, in torch.cuda.set_device(opt.local_rank) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/cuda/init.py", line 311, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "test_tryon.py", line 25, in torch.cuda.set_device(opt.local_rank) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/cuda/init.py", line 311, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "test_tryon.py", line 25, in torch.cuda.set_device(opt.local_rank) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/cuda/init.py", line 311, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "test_tryon.py", line 25, in torch.cuda.set_device(opt.local_rank) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/cuda/init.py", line 311, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "test_tryon.py", line 25, in torch.cuda.set_device(opt.local_rank) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/cuda/init.py", line 311, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 75518 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 75519) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ test_tryon.py FAILED


Failures: [1]: time : 2023-06-23_09:26:10 host : 192-9-146-177 rank : 2 (local_rank: 2) exitcode : 1 (pid: 75520) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-06-23_09:26:10 host : 192-9-146-177 rank : 3 (local_rank: 3) exitcode : 1 (pid: 75521) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-06-23_09:26:10 host : 192-9-146-177 rank : 4 (local_rank: 4) exitcode : 1 (pid: 75522) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-06-23_09:26:10 host : 192-9-146-177 rank : 5 (local_rank: 5) exitcode : 1 (pid: 75523) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-06-23_09:26:10 host : 192-9-146-177 rank : 6 (local_rank: 6) exitcode : 1 (pid: 75524) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-06-23_09:26:10 host : 192-9-146-177 rank : 7 (local_rank: 7) exitcode : 1 (pid: 75525) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------Root Cause (first observed failure): [0]: time : 2023-06-23_09:26:10 host : 192-9-146-177 rank : 1 (local_rank: 1) exitcode : 1 (pid: 75519) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ `

philz1337x avatar Jun 23 '23 09:06 philz1337x

is your error resolved? @philz1337

rudrapsc avatar Jul 28 '23 12:07 rudrapsc

我遇到了和你类似的问题,我的是没有安装cuda,我又重新安装了和cuda匹配的torch、torchvision以及cuda版本

shaoke317 avatar Dec 17 '24 11:12 shaoke317