manipulathor
manipulathor copied to clipboard
The X server sometimes work but sometimes stop.
Hi. Thanks for your work!
I followed your instructions and checked sometimes the X server works, and sometimes it doesn't. To be specific, the screen just stops and don't move. This happens right after the initialization stage. The final code of the screen is shown as follows:
08/13 23:26:33 INFO: Starting 0-th SingleProcessVectorSampledTasks generator with args {'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7f7c403d73d0>, 'scenes': ['FloorPlan16_physics', 'FloorPlan17_physics', 'FloorPlan18_physics', 'FloorPlan19_physics', 'FloorPlan20_physics'], 'env_args': {'gridSize': 0.25, 'width': 224, 'height': 224, 'visibilityDistance': 1.0, 'agentMode': 'arm', 'fieldOfView': 100, 'agentControllerType': 'mid-level', 'server_class': <class 'ai2thor.fifo_server.FifoServer'>, 'useMassThreshold': True, 'massThreshold': 10, 'autoSimulation': False, 'autoSyncTransforms': True, 'renderDepthImage': True, 'x_display': '0.1'}, 'max_steps': 200, 'sensors': [<ithor_arm.ithor_arm_sensors.DepthSensorThor object at 0x7f7cccede370>, <allenact_plugins.ithor_plugin.ithor_sensors.RGBSensorThor object at 0x7f7c403c7ac0>, <ithor_arm.ithor_arm_sensors.RelativeAgentArmToObjectSensor object at 0x7f7c403c7c40>, <ithor_arm.ithor_arm_sensors.RelativeObjectToGoalSensor object at 0x7f7c403c7d90>, <ithor_arm.ithor_arm_sensors.PickedUpObjSensor object at 0x7f7c403d70a0>], 'action_space': Discrete(13), 'seed': 506456969, 'deterministic_cudnn': False, 'rewards_config': {'step_penalty': -0.01, 'goal_success_reward': 10.0, 'pickup_success_reward': 5.0, 'failed_stop_reward': 0.0, 'shaping_weight': 1.0, 'failed_action_penalty': -0.03}, 'scene_period': 'manual', 'sampler_mode': 'train', 'cap_training': None} [vector_sampled_tasks.py: 975]
After this there is no console information that is given, and I confirmed that the entire system is not working. Do you know when does this happen and how can I solve this?
When I terminated the process the error is given as follows:
Traceback (most recent call last): File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap util._exit_function() File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/util.py", line 357, in _exit_function p.join() File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait return self.poll(os.WNOHANG if timeout == 0.0 else 0) File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_forkserver.py", line 65, in poll if not wait([self.sentinel], timeout): File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt Traceback (most recent call last): File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap util._exit_function() File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/util.py", line 357, in _exit_function p.join() File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait return self.poll(os.WNOHANG if timeout == 0.0 else 0) File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_forkserver.py", line 65, in poll if not wait([self.sentinel], timeout): File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt ^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt
Lastly, I want to ask is there any upgrade plans for this framework. Compared to the allenact repository, this framework may be seen as quite outdated (e.g., the ai2thor version is 0.0.1, but the current version is 5.0.0). I'd really appreciate if this is taken into consideration.
Thank you.
Hi @JisuHann ,
Can you try reducing the number of processes used during training (i.e. change this line) and see if this allows training to proceed?
A newer version of this codebase can be found at https://github.com/allenai/disturb-free which was a follow-up work. The version of ai2thor
used by that work is still <5.0.0 but it is more recent.
Thank you for quick response, @Lucaweihs !
First of all, I tried reducing the number of processes but also stopping issue happened again. GPU spec is 2 RTX A6000 with 100 CPU cores, and num_processes that have been experimented is 2 (1 per GPU) to 40 (20 processes per GPU). With only less number of num_processes, not stopping issue happen on very low possibility.
To debug this, I captured one phenomenon on this issue. While each episode, I've tried to print where does the stopping point happens. It turns out it happened on the yield
statement: Before and after statement worked well, but somewhat reason it stopped at the yield statement while getting the observation_space_command
or action_space_command
at every step.
I confirmed that the command
and res
object outputs well, for example:
command
would be observation_space_command
or action_space_command
, and res
can be Dict(depth_lowres:Box(-2.0, 18.0, (224, 224, 1), float32), rgb_lowres:Box(-2.1179039478302, 2.640000104904175, (224, 224, 3), float32), relative_agent_arm_to_obj:Box(-100.0, 100.0, (3,), float32), relative_obj_to_goal:Box(-100.0, 100.0, (3,), float32), pickedup_object:Box(0.0, 1.0, (1,), float32)).
So my question is have you experienced problem held in yield statement and if so how can I solve this?
Second, I've tried the disturb-free repository that you recommended, and the same experiment happens again (even with less number of processes). By the way, I've tried to another machine with 4 GeForce RTX 3090, but it does not work as well.
I would attach the details of my machine.
- Platform: Linux-6.2.0-26-generic-x86_64-with-debian-bookworm-sid
- 4 GeForce RTX 3090, CUDA Version: 11.7
- Python version: 3.7.12
- PyTorch version: 1.13.0+cu117
- Tensorflow version: 2.7.4