cosypose icon indicating copy to clipboard operation
cosypose copied to clipboard

Multi-gpu on a single node

Open Arrebol2020 opened this issue 3 years ago • 4 comments

Hello, I can success run 'Single gpu on a single node', but when I try to use ‘Multi-gpu on a single node’, I get the following error:

Traceback (most recent call last): File "/home/sevati/anaconda3/envs/cosypose/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/sevati/anaconda3/envs/cosypose/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/sevati/anaconda3/envs/cosypose/lib/python3.7/site-packages/cosypose-1.0.0-py3.7-linux-x86_64.egg/cosypose/scripts/run_cosypose_eval.py", line 16, in from cosypose.config import EXP_DIR, MEMORY, RESULTS_DIR, LOCAL_DATA_DIR File "/home/sevati/anaconda3/envs/cosypose/lib/python3.7/site-packages/cosypose-1.0.0-py3.7-linux-x86_64.egg/cosypose/config.py", line 33, in assert LOCAL_DATA_DIR.exists() AssertionError Setting OMP and MKL num threads to 1.

Why the LOCAL_DATA_DIR is in the python3.7/site-packages/cosypose-1.0.0-py3.7-linux-x86_64.egg/cosypose/config.py not in the projects/cosypose/cosypose

Arrebol2020 avatar Jul 20 '21 07:07 Arrebol2020

And now it change to:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled cuda error, NCCL version 2.7.8 Setting OMP and MKL num threads to 1.

Arrebol2020 avatar Jul 21 '21 07:07 Arrebol2020

@Arrebol2020 Hello,have you solved it?

anxiaomi avatar Nov 29 '21 02:11 anxiaomi

@Arrebol2020 Hello,have you solved it?

I didn't sovle it, so I try to implement DDP by myself, it seems to work.

Arrebol2020 avatar Nov 30 '21 12:11 Arrebol2020

@Arrebol2020 might you share your work?

kochsebastian avatar Dec 13 '21 12:12 kochsebastian