habitat-lab icon indicating copy to clipboard operation
habitat-lab copied to clipboard

EOF Error after force stop training

Open HuskyKingdom opened this issue 2 years ago • 6 comments

I was running the provided baseline code using the following command: python3 -u habitat_baselines/run.py --exp-config habitat_baselines/config/pointnav/ppo_pointnav_example.yaml --run-type train, training was good at first for few hours until I stoped it by ctrl+c in terminal, when I try to rerun it, I got the EOF error when constructing env, shown as following:


Process ForkServerProcess-1: Traceback (most recent call last): File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/core/vector_env.py", line 233, in _worker_env env = env_fn(*env_fn_args) File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/utils/env_utils.py", line 32, in make_env_fn env = env_class(config=config, dataset=dataset) TypeError: 'NoneType' object is not callable Traceback (most recent call last): File "habitat_baselines/run.py", line 81, in <module> main() File "habitat_baselines/run.py", line 40, in main run_exp(**vars(args)) File "habitat_baselines/run.py", line 77, in run_exp execute_exp(config, run_type) File "habitat_baselines/run.py", line 60, in execute_exp trainer.train() File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat_baselines/rl/ppo/ppo_trainer.py", line 729, in train self._init_train() File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat_baselines/rl/ppo/ppo_trainer.py", line 259, in _init_train self._init_envs() File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat_baselines/rl/ppo/ppo_trainer.py", line 206, in _init_envs workers_ignore_signals=is_slurm_batch_job(), File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/utils/env_utils.py", line 116, in construct_envs workers_ignore_signals=workers_ignore_signals, File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/core/vector_env.py", line 194, in __init__ read_fn() for read_fn in self._connection_read_fns File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/core/vector_env.py", line 194, in <listcomp> read_fn() for read_fn in self._connection_read_fns File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/core/vector_env.py", line 97, in __call__ res = self.read_fn() File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 67, in recv buf = self.recv_bytes() File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception ignored in: <function VectorEnv.__del__ at 0x7fe898c9f320> Traceback (most recent call last): File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/core/vector_env.py", line 592, in __del__ self.close() File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/core/vector_env.py", line 460, in close read_fn() File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/core/vector_env.py", line 97, in __call__ res = self.read_fn() File "/Users/topsofter/Desktop/PhD/Embodied_AI/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 67, in recv buf = self.recv_bytes() File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/Users/topsofter/opt/miniconda3/envs/habitat/lib/python3.7/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError:


Please may I have some help on how to fix it, thx~

HuskyKingdom avatar Nov 18 '22 17:11 HuskyKingdom

I have encountered the same error. Have you resolved it?Thank you.

SepMJ avatar Apr 18 '23 14:04 SepMJ

I also encountered this when using multi-node-slurm.sh for DDPPO. How to solve this?

YinpeiDai avatar May 20 '23 00:05 YinpeiDai

I have encountered the same error. How to solve this? Thank you.

jinxin-zhu avatar Aug 25 '23 02:08 jinxin-zhu

我遇到了同样的错误。如何解决这个问题?

Moon-heart avatar Dec 14 '23 13:12 Moon-heart

I have encountered the same error. How to solve this?

wu-jintao avatar Feb 25 '24 13:02 wu-jintao

Hey all, it would be helpful to know a bit more. Can you try re-running with the debug environment flag export HABITAT_ENV_DEBUG=1 this should make errors that happen in the multiprocessing environment explicit.

aclegg3 avatar Feb 26 '24 17:02 aclegg3