ray
ray copied to clipboard
[Core] "The actor died unexpectedly before finishing this task." with aws spot instances
What happened + What you expected to happen
what happened? Previously I was using python3.8 with ray 2.7.0 to run some python parallel processing script on a cluster of aws spot instances and it was running ok.
Recently, I wanted to upgrade to python3.11 and also updated some other dependencies not related to ray, here is the current list of dependencies that I'm using
python -m pip install \ ipython==8.18.1 \ scipy==1.11.4 \ botocore==1.33.2 \ urllib3==2.0.7 \ boto3==1.33.2 \ s3fs==2023.12.2 \ pandas==2.1.3 \ matplotlib==3.8.2 \ prompt_toolkit==3.0.41 \ protobuf==3.20.3 \ elasticsearch==8.11.0 \ "pydantic<2" \ "ray[default]==2.7.0" \ dataclasses-json==0.6.3 \ cloudpickle==3.0.0 \ ipyparallel==8.6.1 \ orjson==3.9.10 \ numba==0.58.1 \ peewee==3.17.0 \ PyMySQL==1.1.0 \ async-timeout==4.0.3
With this dependencies, I got this error around 30% of times when I execute python script on aws spot instances ray cluster (I tried ondemand instances and no issue at all if I use ondemand instances). Another thing I noticed is that this issue happens more frequently when I left the cluster running for longer period of time (so I think it's due to spot instance lost).
[ec2-user@ip-10-130-5-241 ~]$ ray job submit --working-dir ~/ray_test -- python ray_debug.py
Job submission server address: http://10.130.5.241:8265
2024-02-12 17:01:39,892 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_21afd162505ebe90.zip.
2024-02-12 17:01:39,892 INFO packaging.py:518 -- Creating a file package for local directory '/home/ec2-user/ray_test'.
-------------------------------------------------------
Job 'raysubmit_TCpJyPxi2yppmKnL' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_TCpJyPxi2yppmKnL
Query the status of the job:
ray job status raysubmit_TCpJyPxi2yppmKnL
Request the job to be stopped:
ray job stop raysubmit_TCpJyPxi2yppmKnL
Tailing logs until the job exits (disable with --no-wait):
---------------------------------------
Job 'raysubmit_TCpJyPxi2yppmKnL' failed
---------------------------------------
Status message: Unexpected error occurred: The actor died unexpectedly before finishing this task.
class_name: JobSupervisor
actor_id: 64ef60b8270d87cae4b61a7501000000
name: _ray_internal_job_actor_raysubmit_TCpJyPxi2yppmKnL
namespace: SUPERVISOR_ACTOR_RAY_NAMESPACE
The actor is dead because its node has died. Node Id: 2f01240d7514bb758df373378ba943452a8af7a7071d4b9863f3d4dc
The actor never ran - it was cancelled before it started running.
The actual script I used is confidential, but I think this issue occurs regardless of which python script I'm running so I reproduced it with a dummy script here.
I also tried to manually configure with --system-config='{"num_heartbeats_timeout":3000, "heartbeat_timeout_milliseconds":10000}' but I think they are not supported anymore
What I expect to happen: I expect the script to be running with no issue
Versions / Dependencies
Python3.11
python -m pip install \ ipython==8.18.1 \ scipy==1.11.4 \ botocore==1.33.2 \ urllib3==2.0.7 \ boto3==1.33.2 \ s3fs==2023.12.2 \ pandas==2.1.3 \ matplotlib==3.8.2 \ prompt_toolkit==3.0.41 \ protobuf==3.20.3 \ elasticsearch==8.11.0 \ "pydantic<2" \ "ray[default]==2.7.0" \ dataclasses-json==0.6.3 \ cloudpickle==3.0.0 \ ipyparallel==8.6.1 \ orjson==3.9.10 \ numba==0.58.1 \ peewee==3.17.0 \ PyMySQL==1.1.0 \ async-timeout==4.0.3
Reproduction script
import ray
import time
@ray.remote
def do_nothing_for_a_while(seconds):
"""A function that does nothing for a specified number of seconds."""
time.sleep(seconds)
def main():
# Initialize Ray
ray.init()
# Total wait time in seconds (20 minutes)
total_wait_time = 20 * 60
# Number of tasks to divide the wait time into
num_tasks = 10
# Time each task should wait
wait_time_per_task = total_wait_time / num_tasks
# Launch tasks
tasks = [do_nothing_for_a_while.remote(wait_time_per_task) for _ in range(num_tasks)]
# Wait for all tasks to complete
ray.get(tasks)
# Shutdown Ray
ray.shutdown()
if __name__ == "__main__":
main()
Issue Severity
High: It blocks me from completing my task.
Does it happen if you don't use spot instance?
No it does not happen when I use ondemand instances
Just want to add that I have all the logs ready, just let me know which one is required and I can provide them, thanks for helping
Hmm isn't that just the actor died because spot instance is interrupted then?
Yes, that's true, but it happened almost every time when I launch my cluster, I believe launching spot instances cluster should be allowed? is there some configurations I can change to allow ray to work with spot instances cluster? Also, I used to be able to use spot cluster with python3.8.
@shiyuc6688,
If you are using spot instance, then you should expect that it may fail and actors on it will fail as well and you need to handle this case: https://docs.ray.io/en/latest/ray-core/fault-tolerance.html
Ok, thanks