habitat-sim icon indicating copy to clipboard operation
habitat-sim copied to clipboard

Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0 WindowlessContext: Unable to create windowless context

Open zhangyu0110 opened this issue 10 months ago • 1 comments
trafficstars

Hello, I am using habitat-sim 0.1.7 in a Docker container. When I train with one 3090 GPU, everything works fine, but when I use two GPUs, the following error occurs. Could you please help me understand why?

CUDA_VISIBLE_DEVICES=0,1 bash run_r2r/main.bash train 2333

train mode

/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appli cation as needed. ***************************************** 2025-01-02 13:50:51,734 Initializing dataset VLN-CE-v1 2025-01-02 13:50:51,734 Initializing dataset VLN-CE-v1 2025-01-02 13:50:52,398 SPLTI: train, NUMBER OF SCENES: 61 2025-01-02 13:50:52,398 SPLTI: train, NUMBER OF SCENES: 61 2025-01-02 13:50:55,648 Initializing dataset VLN-CE-v1 2025-01-02 13:50:55,650 Initializing dataset VLN-CE-v1 2025-01-02 13:50:55,717 Initializing dataset VLN-CE-v1 2025-01-02 13:50:55,720 Initializing dataset VLN-CE-v1 2025-01-02 13:50:55,727 Initializing dataset VLN-CE-v1 2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1 2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1 2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1 2025-01-02 13:50:56,349 initializing sim Sim-v1 2025-01-02 13:50:56,351 initializing sim Sim-v1 Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0 WindowlessContext: Unable to create windowless context Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1 WindowlessContext: Unable to create windowless context 2025-01-02 13:50:56,430 initializing sim Sim-v1 2025-01-02 13:50:56,432 initializing sim Sim-v1 2025-01-02 13:50:56,436 initializing sim Sim-v1 2025-01-02 13:50:56,440 initializing sim Sim-v1 2025-01-02 13:50:56,443 initializing sim Sim-v1 2025-01-02 13:50:56,444 initializing sim Sim-v1 Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0 WindowlessContext: Unable to create windowless context Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1 WindowlessContext: Unable to create windowless context Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0 WindowlessContext: Unable to create windowless context Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1 WindowlessContext: Unable to create windowless context Traceback (most recent call last): File "run.py", line 113, in Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1 WindowlessContext: Unable to create windowless context main() File "run.py", line 49, in main Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0 WindowlessContext: Unable to create windowless context run_exp(**vars(args))

File "run.py", line 106, in run_exp trainer.train() File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 451, in train observation_space, action_space = self._init_envs() File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 168, in _init_envs auto_reset_done=False File "/home/ETPNav/vlnce_baselines/common/env_utils.py", line 122, in construct_envs workers_ignore_signals=workers_ignore_signals, File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in init read_fn() for read_fn in self._connection_read_fns File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in read_fn() for read_fn in self._connection_read_fns File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call res = self.read_fn() File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv buf = self.recv_bytes() File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer Traceback (most recent call last): File "run.py", line 113, in main() File "run.py", line 49, in main run_exp(**vars(args)) File "run.py", line 106, in run_exp trainer.train() File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 451, in train observation_space, action_space = self._init_envs() File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 168, in _init_envs auto_reset_done=False File "/home/ETPNav/vlnce_baselines/common/env_utils.py", line 122, in construct_envs workers_ignore_signals=workers_ignore_signals, File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in init read_fn() for read_fn in self._connection_read_fns File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in read_fn() for read_fn in self._connection_read_fns File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call res = self.read_fn() File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv buf = self.recv_bytes() File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7fec60fd9358>> Traceback (most recent call last): File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 588, in del self.close() File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 456, in close read_fn() File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call res = self.read_fn() File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv buf = self.recv_bytes() File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError: Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7f8dac95c358>> Traceback (most recent call last): File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 588, in del self.close() File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 456, in close read_fn() File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call res = self.read_fn() File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv buf = self.recv_bytes() File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2299562) of binary: /root/miniconda3/envs/vlnce/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/vlnce/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/miniconda3/envs/vlnce/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in main() File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/run.py", line 692, in run )(*cmd_args) File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


         run.py FAILED             

======================================= Root Cause: [0]: time: 2025-01-02_13:50:58 rank: 0 (local_rank: 0) exitcode: 1 (pid: 2299562) error_file: <N/A> msg: "Process failed with exitcode 1"

Other Failures: [1]: time: 2025-01-02_13:50:58 rank: 1 (local_rank: 1) exitcode: 1 (pid: 2299563) error_file: <N/A> msg: "Process failed with exitcode 1"


zhangyu0110 avatar Jan 03 '25 02:01 zhangyu0110