minerl
minerl copied to clipboard
TimeoutError: timed out
Hi, how can I solve this timed out error? I'm executing my python code with nohup command.
Here is the log:
2023-07-19 21:15:15.275057: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-19 21:15:15.840035: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/gti/miniconda3/envs/tensor-env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/patch_gym.py:49: UserWarning: You provided an OpenAI Gym environment. We strongly recommend transitioning to Gymnasium environments. Stable-Baselines3 is automatically wrapping your environments in a compatibility layer, which could potentially cause issues.
warnings.warn(
Using cuda device
Wrapping the env with a Monitor
wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.
Model created!
Starting learning!
Logging to ./ppo_minerl_tensorboard/PPO_1
Traceback (most recent call last):
File "/home/gti/TFG/PPO.py", line 79, in
Enable full logging (see example in minerl.io/docs regarding logging
library). This will provide more info what went wrong with MineRL. Make sure you have valid display attached or you run your code with xvfb-run -a python ...
.
Hi Miffyil,
I also am facing a similar problem. I am attempting to train an agent and at around the 40th episode, with 2000 steps per an episode I receive a socket timeout error. I've attached the error message below. It appears that I am running out of memory. I would greatly appreciate your insights and suggestions on how to overcome this challenge.
Environment Details:
Operating System: Windows 11 Terminal Java Version: java version "1.8.0_333" Terminal Java Compiler Version: javac javac 1.8.0_333 Python Version in Conda Environment: Python 3.9.17 Memory: 16 GB CPU: Intel i7-13th Gen GPU: Nividia Geforce RTX 4050
Steps Taken:
- Decreasing the complexity of my script
- Moved most procceses on to my gpu
- Restart my PC
Despite these attempts, the problem still persists, and I'm unsure about how to proceed. Any guidance or additional information you could provide would be immensely helpful.
Thank you for your time and help. Please let me know if any further information is needed.
MineRL is known to leak bit of memory (sometimes it is not a problem, sometimes it is). The best remedy is to reboot the environment every now and then. I also wrap all reset
and step
calls around try-except
, and reboot the environment if error is encountered.
I've tried the answer you told me and I'm still facing timed out error always at around 130k steps, from a 5M steps code.
Always the timed out error is present, I face this other error: "only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices". Based on my code, it's coming from getting the POV out of the obs dictionary, meaning that maybe the obs is not being created right.
If you want to see the last logs:
[01:59:43] [Render thread/INFO]: Environment: authHost='https://authserver.mojang.com', accountsHost='https://api.mojang.com', sessionHost='https://sessionserver.mojang.com', servicesHost='https://api.minecraftservices.com', name='PROD'
2284 DEBUG:minerl.env.malmo.instance.84c473:[01:59:43] [Render thread/INFO]: Starting integrated minecraft server version 1.16.5
2285 DEBUG:minerl.env.malmo.instance.84c473:[01:59:43] [Render thread/INFO]: Generating keypair
2286 DEBUG:minerl.env.malmo.instance.84c473:[01:59:49] [Render thread/INFO]: Preparing start region for dimension minecraft:overworld
2287 DEBUG:minerl.env.malmo.instance.84c473:[02:00:02] [Render thread/INFO]: Changing view distance to 11, from 10
2288 DEBUG:minerl.env.malmo.instance.84c473:[02:00:02] [Render thread/INFO]: MineRLAgent0[local:E:8df182a0] logged in with entity id 159 at (-509.5, 67.0, -773.5)
2289 DEBUG:minerl.env.malmo.instance.84c473:[02:00:02] [Render thread/INFO]: MineRLAgent0 joined the game
2290 DEBUG:minerl.env.malmo.instance.84c473:[02:00:02] [Render thread/INFO]: Preparing spawn area: 0%
2291 DEBUG:minerl.env.malmo.instance.84c473:[02:00:02] [Render thread/INFO]: Time elapsed: 5 ms
2292 DEBUG:minerl.env.malmo.instance.84c473:[02:00:02] [Render thread/INFO]: [STDOUT]: Starting new video null
2293 DEBUG:minerl.env.malmo.instance.84c473:[02:00:02] [Render thread/INFO]: Saving and pausing game...
2294 DEBUG:minerl.env.malmo.instance.84c473:[02:00:02] [Render thread/INFO]: Saving chunks for level 'ServerLevel[mcpworlde9ccc189cdfd]'/minecraft:overworld
2295 DEBUG:minerl.env.malmo.instance.84c473:[02:00:03] [Render thread/INFO]: Saving chunks for level 'ServerLevel[mcpworlde9ccc189cdfd]'/minecraft:the_nether
2296 DEBUG:minerl.env.malmo.instance.84c473:[02:00:03] [Render thread/INFO]: Saving chunks for level 'ServerLevel[mcpworlde9ccc189cdfd]'/minecraft:the_end
2297 DEBUG:minerl.env.malmo.instance.84c473:[02:00:03] [Render thread/INFO]: Loaded 0 advancements
2298 DEBUG:minerl.env._multiagent:Peeking the clients.
2299 DEBUG:minerl.env._multiagent:Closing MineRL env...
2300 Encountered exception during reset: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices. Recreating environment.
2301 DEBUG:minerl.env.malmo.instance.84c473:[02:00:06] [EnvServerSocketHandler/INFO]: [STDOUT]: *** Stopping the replay, returning control to the inputs
2302 INFO:process_watcher:About to reap process tree of 352313:launchClient.sh: i zombie, owner 337663, printing process tree status in termination order:
2303 INFO:process_watcher: -352313:launchClient.sh: i zombie, owner 337663
2304 INFO:process_watcher:Trying to SIGTERM 352313:launchClient.sh: i zombie, owner 337663
2305 INFO:process_watcher:Process psutil.Popen(pid=352313, name='launchClient.sh', status='terminated', exitcode=0, started='01:59:14') terminated with exit code 0
2306 Traceback (most recent call last):
2307 File "/home/dsaneng/simple.py", line 59, in step
2308 self.reset()
2309 File "/home/dsaneng/simple.py", line 79, in reset
2310 raise RuntimeError("Too many exceptions during reset, creating environment, giving up")
2311 RuntimeError: Too many exceptions during reset, creating environment, giving up
2312 During handling of the above exception, another exception occurred:
2313 Traceback (most recent call last):
2314 File "/home/dsaneng/simple.py", line 158, in
By the way, I'm executing my program on a GPU too, as @Patrickjliu said. Maybe it has to do with that?
The error is bit confusing indeed, but still tied to the environment crashing; the Python code expects some proper replies but gets empty buffers, and thus crashes like this.
I still think wrapping with step
and reset
with try-except
should work, and if something is risen, delete the environment and recreate. What is the code logic looking like? Note that resetting the environment after crash might also fail for a moment (e.g., some process still hangs around), so you might need to keep try-except resetting the environment until it works.
Running MineRL on GPU (and your GPU code) should not really affect things much unless you really run out of VRAM completely (might want to check that). If there was some hard conflict, you would not be able to run the code in the first place.
Could you provide me some example code to see what you actually mean by deleting and creating a new environment?
I think I've done that properly and still got the error.
Right now I've decided to reset the environment every 50k timesteps, once the episode has ended, to see if that helps, being that it always fails at around 130k.
The usual
env.close()
env = gym.create("MineRLBasaltFindCave-v0")
Should be enough.
I would still add try-except checks around step and reset in addition to your regular resetting just in case. If you systematically always get a crash at specific steps, you might also want to check the memory use of the machine if it is growing as training progresses (or if VRAM use increases).
I've tried what you suggested and I'm still getting the same error: 1736 File "/usr/local/lib/python3.8/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 259, in learn 1737 continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) 1738 File "/usr/local/lib/python3.8/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 178, in collect_rollouts 1739 new_obs, rewards, dones, infos = env.step(clipped_actions) 1740 File "/usr/local/lib/python3.8/dist-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 197, in step 1741 return self.step_wait() 1742 File "/usr/local/lib/python3.8/dist-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 95, in step_wait 1743 observations, rewards, dones, infos = self.venv.step_wait() 1744 File "/usr/local/lib/python3.8/dist-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 70, in step_wait 1745 obs, self.reset_infos[env_idx] = self.envs[env_idx].reset() 1746 File "/usr/local/lib/python3.8/dist-packages/stable_baselines3/common/monitor.py", line 83, in reset 1747 return self.env.reset(**kwargs) 1748 File "/usr/local/lib/python3.8/dist-packages/shimmy/openai_gym_compatibility.py", line 244, in reset 1749 return self.gym_env.reset(), {} 1750 File "/usr/local/lib/python3.8/dist-packages/gym/wrappers/monitor.py", line 53, in reset 1751 observation = self.env.reset(**kwargs) 1752 File "PPO.py", line 70, in reset 1753 obs = self.env.reset(**kwargs) 1754 File "/usr/local/lib/python3.8/dist-packages/gym/wrappers/time_limit.py", line 27, in reset 1755 return self.env.reset(**kwargs) 1756 File "/usr/local/lib/python3.8/dist-packages/minerl/herobraine/env_specs/basalt_specs.py", line 78, in reset 1757 return self.env.reset() 1758 File "/usr/local/lib/python3.8/dist-packages/minerl/herobraine/env_specs/basalt_specs.py", line 57, in reset 1759 return super().reset() 1760 File "/usr/local/lib/python3.8/dist-packages/gym/core.py", line 251, in reset 1761 return self.env.reset(**kwargs) 1762 File "/usr/local/lib/python3.8/dist-packages/minerl/env/_singleagent.py", line 22, in reset 1763 multi_obs = super().reset() 1764 File "/usr/local/lib/python3.8/dist-packages/minerl/env/_multiagent.py", line 446, in reset 1765 self._send_mission(self.instances[0], agent_xmls[0], self._get_token(0, ep_uid)) # Master 1766 File "/usr/local/lib/python3.8/dist-packages/minerl/env/_multiagent.py", line 605, in _send_mission 1767 reply = comms.recv_message(instance.client_socket) 1768 File "/usr/local/lib/python3.8/dist-packages/minerl/env/comms.py", line 63, in recv_message 1769 lengthbuf = recvall(sock, 4) 1770 File "/usr/local/lib/python3.8/dist-packages/minerl/env/comms.py", line 73, in recvall 1771 newbuf = sock.recv(count) 1772 socket.timeout: timed out 1773 ERROR:minerl.env.malmo:Attempted to send kill command to minecraft process and failed with exception timed out 1774 INFO:process_watcher:About to reap process tree of 115:launchClient.sh:/usr/bin/bash i sleeping, owner 44, printing process tree status in termination order: 1775 INFO:process_watcher: -118:java:/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java i sleeping, owner 115 1776 INFO:process_watcher: -115:launchClient.sh:/usr/bin/bash i sleeping, owner 44 1777 INFO:process_watcher:Trying to SIGTERM 118:java:/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java i sleeping, owner 115 1778 INFO:process_watcher:Process 118 survived SIGTERM; trying SIGKILL on 118:java:/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java i sleeping, owner 115 1779 DEBUG:minerl.env.malmo.instance.59a811:/usr/local/lib/python3.8/dist-packages/minerl/env/../MCP-Reborn/launchClient.sh: line 52: 118 Killed java -Xmx$maxMem -jar $fatjar --envPort=$port 1780 INFO:process_watcher:Process psutil.Process(pid=118, name='java', status='terminated', started='16:40:35') terminated with exit code None 1781 INFO:process_watcher:Trying to SIGTERM 115:launchClient.sh:/usr/bin/bash i zombie, owner 44 1782 INFO:process_watcher:Process psutil.Popen(pid=115, name='launchClient.sh', status='terminated', exitcode=0, started='16:40:35') terminated with exit code 0.
Here you can see my code:
import gym
from time import sleep
from gym.spaces import Discrete
from gym.wrappers import Monitor
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
import minerl
import wandb
import logging
logging.basicConfig(level=logging.DEBUG)
config = {
"policy_type": "CnnPolicy",
"total_timesteps": 5000000,
"env_name": "MineRLBasaltFindCave-v0",
"run_name": "PPO_MineRLBasaltFindCave-v0"
}
run = wandb.init(
project="MineRL",
entity="minerl_tfg",
name=config["run_name"],
config=config,
sync_tensorboard=True,
monitor_gym=True,
save_code=True,
)
def make_env():
env = gym.make(config["env_name"])
env = BitMaskWrapper(env) # Apply BitMaskWrapper first
env = Monitor(env, directory="monitor_results", force=True) # Then apply Monitor
print("Nuevo entorno creado!!!")
return env
class BitMaskWrapper(gym.Wrapper):
def __init__(self, env):
super(BitMaskWrapper, self).__init__(env)
self.orig_action_space = self.action_space
self.action_space = gym.spaces.Discrete(32) # Modify the action space to Discrete(32)
self.observation_space = self.observation_space['pov']
self.noop_action = self.orig_action_space.noop() # Pre-calculate no-op action
def step(self, action):
while True: # Keep trying to step until successful
try:
assert 0 <= action < 64, "Invalid action"
masked_action = self._apply_bit_mask(action)
obs, reward, done, info = self.env.step(masked_action)
if info: # Print info only if it's not empty
print("Info dictionary:", info)
obs = obs["pov"]
obs = obs / 255.0
return obs, reward, done, info
except TimeoutError:
print("Ha ocurrido un TimeoutError. Intentando volver a crear el entorno...\n")
self.env.close()
self.env = make_env() # Recreate the environment
sleep(1) # Adding a delay to ensure proper cleanup
def reset(self, **kwargs):
while True: # Keep trying to reset until successful
try:
obs = self.env.reset(**kwargs)
obs = obs["pov"]
obs = obs / 255.0
return obs
except TimeoutError:
print("Ha ocurrido un TimeoutError durante el reset. Intentando volver a crear el entorno...")
self.env.close()
self.env = make_env() # Recreate the environment
sleep(1) # Adding a delay to ensure proper cleanup
def _apply_bit_mask(self, action):
"""Applies the bit mask to the action."""
back_m = action & 1
forward_m = (action >> 1) & 1
left_m = (action >> 2) & 1
right_m = (action >> 3) & 1
sprint_m = (action >> 4) & 1
action = self.noop_action.copy()
action['sprint'] = sprint_m
action['right'] = right_m
action['left'] = left_m
action['forward'] = forward_m
action['back'] = back_m
return action
def get_action_meanings(self):
# Override this method to reflect the modified action space
return [str(i) for i in range(self.action_space.n)]
def render(self, mode='human', **kwargs):
# Override the render method if necessary
return self.env.render(mode, **kwargs)
def seed(self, seed=None):
# Forward the seed call to the wrapped environment
return self.env.seed(seed)
class WandbCallback(BaseCallback):
def __init__(self, verbose=0):
super(WandbCallback, self).__init__(verbose)
self.last_logged_episode = -1
def _on_rollout_end(self):
env = self.training_env.envs[0].env
rewards = env.get_episode_rewards()
lengths = env.get_episode_lengths()
if len(rewards) > self.last_logged_episode + 1:
mean_reward = sum(rewards[self.last_logged_episode+1:]) / len(rewards[self.last_logged_episode+1:])
mean_length = sum(lengths[self.last_logged_episode+1:]) / len(lengths[self.last_logged_episode+1:])
total_timesteps = env.get_total_steps() # Retrieve total steps from the Monitor wrapper
wandb.log({
'mean_reward': mean_reward,
'mean_episode_length': mean_length,
'total_timesteps': total_timesteps
})
self.last_logged_episode = len(rewards) - 1
def _on_step(self):
return True
# Create the BitMaskWrapper around the MineRL environment
env = make_env()
print("Environment created!\n")
# Create your model (e.g., PPO)
model = PPO(config["policy_type"], env, verbose=0, device="cuda")
print("PPO model created!\n")
# Train your model with the callback
model.learn(total_timesteps=config["total_timesteps"], callback=WandbCallback())`
I haven't been able to solve this.
I've changed the code so that it catches every exception and tries to recreate the environment. Now, after the same 120-130k timesteps I get this error over an over, even reseting the environment as done on the ealier comment.
Error:
1305 ----------------------------------------
1306 | rollout/ | |
1307 | ep_len_mean | 3.11e+03 |
1308 | ep_rew_mean | 0 |
1309 | time/ | |
1310 | fps | 17 |
1311 | iterations | 62 |
1312 | time_elapsed | 7087 |
1313 | total_timesteps | 126976 |
1314 | train/ | |
1315 | approx_kl | 0.24600112 |
1316 | clip_fraction | 0.726 |
1317 | clip_range | 0.2 |
1318 | entropy_loss | -3.34 |
1319 | explained_variance | -1.13 |
1320 | learning_rate | 0.0003 |
1321 | loss | -0.152 |
1322 | n_updates | 610 |
1323 | policy_gradient_loss | -0.117 |
1324 | value_loss | 0.00017 |
1325 ----------------------------------------
1326 Info dictionary: {'TimeLimit.truncated': False}
1327 Ha ocurrido el siguiente error durante el reset: timed out
1328 Intentando volver a crear el entorno...
1329 Attempted to send kill command to minecraft process and failed with exception timed out
1330 Nuevo entorno creado!!!
1331 Ha ocurrido el siguiente error durante el reset: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
1332 Intentando volver a crear el entorno...
1333 Nuevo entorno creado!!!
1334 Ha ocurrido el siguiente error durante el reset: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
1335 Intentando volver a crear el entorno...
1336 Nuevo entorno creado!!!
1337 Ha ocurrido el siguiente error durante el reset: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
1338 Intentando volver a crear el entorno...
1339 Nuevo entorno creado!!!
1340 Ha ocurrido el siguiente error durante el reset: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
1341 Intentando volver a crear el entorno...
1342 Nuevo entorno creado!!!
I would add additional try-except catch inside the make_env
, maybe with a loop to try for few times (with a 60s delay), and if then the environment fails to start properly, give up and crash. Apart from that hard to say what is happening. It sounds like there is something going wrong after training for long enough (e.g., running out RAM), given the environment crashes at very specific point. I would investigate that further.
Hi I'm back.
First of all, since you pointed out that it could be a memory issue, I've moved the executions to a HPC cluster, to avoid those types of issues. For that I had to create a Singularity container based on the eg-docker repository that you have on MineRL documentation. Maybe I should create a repository with the files to reproduce the container, in case someone wants to use this type of environments is able to copy it and have everything working.
On the other hand, once I got the Singularity container working I executed my python code and I still got the same Time out error at the same timesteps number (around 120k-130k) with the try-except as you told me. Since I'm using Wandb for the monitoring of the process, I have access to the system metrics and CPU usage was fine, disk usage gets to 82%, only 7GB out of the 48GB of GPU VRAM were used and the program is only using 5GB of RAM. So I can assume the problem has nothing to do with the computer resources.
Also it's strange that given the other computer I was executing things on and the HPC cluster, they both have the same problem at the same timesteps. Maybe it has something to do with MineRL itself?
Heya.
Re docker: yup sharing results and environments is always good and helps anyone having trouble with things! I would happily also accept PR that adds link to your repo to MineRL docs, if you have the time to create one :)
Re timeouts: huh sounds indeed something that could be off with MineRL. Again, it does experience crashes regardless of the underlying system, but I did not expect it to be so regular across machines. One could spend time debugging this, but realistically, this might be intrinsic to Minecraft as well as we are using it in very unintended ways (re-creating worlds many times). Finding out the core issue could be a big effort (or not, hard to say without knowing where to even look at 😅 )
A quick remedy is to wrap things in try-except
and/or recreate the environment at regular intervals.
Hi,
Re PR: I've done the pull request with the additions to the documentation, check it out!
Re executions: Okay, I'm going to try the recreation every regular intervals and if that doesn't work I'm going to try different things to see if I find a solution. Thanks for your support, I will comment on this issue if I find any solution :)
Hi,
I have good news!
I've finally solved the time out error by downgrading the MineRL library to the 0.4.4 version. More than a month for such an easy solution :_)
I've also found that the training, with the same python code, the same environment and the same computer, is now a 35% faster. With the 1.0.0 version, the 120k timesteps where I got the error took roughly 2 hours. Now with the 0.4.4 version, it took 1 hour and 20 minutes.
I hope this thread helped someone and at least it brought some good things such as the Singularity container.
Now I'm able to finish my Bachelor's Thesis and hopefully graduate :)
Feel free to answer and close the issue and thank you so much for your support, it's awesome that you are helping us every day <3
Nice work @Sanfee18 ! Do note that 0.4.4 is quite a bit different, but if it sufficies for your work, then I'd recommend it indeed :). 0.3.7 might even be faster if that still applies to your case.
Hi @Sanfee18, have you ever been able to train the ppo agent? I tried to use your code with minerlv1.0.1 but the reward remained zero for 110k steps.
Or, have you met the reward issue here?
Hi @huangdi95 , getting rewards on PPO was almost impossible. I got some, but I feel like it was because of the actions mapping, that occasionally set the camera pointing towards a tree and it eventually broke some wood. That’s all I can tell you.
I wouldn’t bother trying to get rewards out of RL alone, I couldn’t get any result.
Thanks! I see.