flingbot
flingbot copied to clipboard
RayActorError: The actor died unexpectedly before finishing this task.
ID: fffffffffffffffffd5f641d1e7ed592e00f065c01000000 Worker ID: 20556ec92d7abbb0b2bf7fb0d363e865ab7587196b608cb3605c3f2f Node ID: 75d7f57c65e988cab741a0c5412548fb08f8fcb7e118eb82fc38fc4d Worker IP address: 172.17.0.2 Worker port: 37185 Worker PID: 1224 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
File "run_sim.py", line 46, in
Have you solved the problem yet? I had the same problem.
I haven't solved the problem yet.
I haven't solved the problem yet.
Have you solved the problem yet?
I'm sorry I haven't solved the problem yet.
------------------ 原始邮件 ------------------ 发件人: "columbia-ai-robotics/flingbot" @.>; 发送时间: 2023年4月13日(星期四) 晚上8:49 @.>; @.@.>; 主题: Re: [columbia-ai-robotics/flingbot] RayActorError: The actor died unexpectedly before finishing this task. (Issue #7)
I haven't solved the problem yet.
Have you solved the problem yet?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Have you solved the problem yet? I had the same problem.
Have you solved the problem yet? I had the same problem.thank you!
I haven't solved the problem yet.
Have you solved the problem yet?
Have you solved the problem yet? I had the same problem.
I recommend tracking RAM usage, since it looks like Ray actors are being killed due to OOM.
If it's true that RAM is the issue, you can use a smaller number of parallel environments and set num_processes
accordingly.
If it's not true that the OOM killer has been killing your Ray actors, you can use one environment (num_processes=1
) and use local_mode=True
when you initialize ray. This should give more informative error messages, and you can debug from there. For instance, the pyflex.init call could fail for many reasons, for which I would refer to the pyflex issues page.
Hope this helps!
I recommend tracking RAM usage, since it looks like Ray actors are being killed due to OOM.
If it's true that RAM is the issue, you can use a smaller number of parallel environments and set
num_processes
accordingly.If it's not true that the OOM killer has been killing your Ray actors, you can use one environment (
num_processes=1
) and uselocal_mode=True
when you initialize ray. This should give more informative error messages, and you can debug from there. For instance, the pyflex.init call could fail for many reasons, for which I would refer to the pyflex issues page.Hope this helps!
Hello, may I ask how many training times it takes to complete the training?
Hey there! From the paper:
All policies are trained in simulation until convergence, which takes around 150,000 simulation steps, or 6 days on a machine with a GTX 1080 Ti