flingbot
flingbot copied to clipboard
when i run `python run_sim.py', the worker died or was killed by an unexpected system error
when i run python run_sim.py --eval --tasks flingbot-normal-rect-eval.hdf5 --load flingbot.pth --num_processes 1 --gui
the error shows:ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-22 15:10:23,194 WARNING worker.py:1228 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff341cd030556402df7c59625701000000 Worker ID: 4f72e151e496fac468e1c730556e291e00ec1cfb29882f51097186fd Node ID: d4a9eb590967aeb63fe838e2eca52cf666565bf009207c0ec4a730e6 Worker IP address: 192.168.1.106 Worker port: 41747 Worker PID: 18687
i don't know why occur this issue, could you please help me?
also ,there is no 'replay_buffer.hdf5' in the 'fingbot_eval_X'
when i run
python run_sim.py --eval --tasks flingbot-normal-rect-eval.hdf5 --load flingbot.pth --num_processes 1 --gui
the error shows:ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-22 15:10:23,194 WARNING worker.py:1228 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff341cd030556402df7c59625701000000 Worker ID: 4f72e151e496fac468e1c730556e291e00ec1cfb29882f51097186fd Node ID: d4a9eb590967aeb63fe838e2eca52cf666565bf009207c0ec4a730e6 Worker IP address: 192.168.1.106 Worker port: 41747 Worker PID: 18687
i don't know why occur this issue, could you please help me?
I met the same issue, I thought it's the issue of ray version, but it turned out to other issues after testing. Have you solved that right now?
Hey,
I got the same error and looking through the ray logs, it's because it can't find the GPU.
To fix this, theres a line in utils.setup_envs
:
envs = [ray.remote(SimEnv).options(
num_gpus=torch.cuda.device_count()/num_processes,
num_cpus=0.1).remote(
replay_buffer_path=dataset,
get_task_fn=lambda: ray.get(task_loader.get_next_task.remote()),
**kwargs)
for _ in range(num_processes)]
The problem is that torch is installed on cpu, which gives torch.cuda.device_count() == 0
and consequently num_gpus=0
. Hardcoding this to be equal to 1 (or whatever number of GPUs you're using) fixes the problem!
Instead of hardcoding the number of GPUs:
I found out that the PyTorch installation and the cuda drivers installed by the flingbot.yml
file are not properly set up. Notice that the boolean torch.cuda.is_available()
is False. You can solve this by re-installing PyTorch using pip
, which already packs compatible cuda drivers. Now verify that torch.cuda.is_available()
is True and the device count is correct.
My problem is when running the evaluation command, it seems like when the animation finishes, the terminal stops at "Evaluating flingbot.pth: saving to flingbot_eval_X/replay_buffer.hdf5", and no changes happen, the replay_buffer.hdf5 is not seen in the directory.
Instead of hardcoding the number of GPUs:
I found out that the PyTorch installation and the cuda drivers installed by the
flingbot.yml
file are not properly set up. Notice that the booleantorch.cuda.is_available()
is False. You can solve this by re-installing PyTorch usingpip
, which already packs compatible cuda drivers. Now verify thattorch.cuda.is_available()
is True and the device count is correct.
Have you successfully run the code for this warehouse?