ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] "The actor died unexpectedly before finishing this task." with aws spot instances

Open shiyuc6688 opened this issue 1 year ago • 5 comments

What happened + What you expected to happen

what happened? Previously I was using python3.8 with ray 2.7.0 to run some python parallel processing script on a cluster of aws spot instances and it was running ok.

Recently, I wanted to upgrade to python3.11 and also updated some other dependencies not related to ray, here is the current list of dependencies that I'm using python -m pip install \ ipython==8.18.1 \ scipy==1.11.4 \ botocore==1.33.2 \ urllib3==2.0.7 \ boto3==1.33.2 \ s3fs==2023.12.2 \ pandas==2.1.3 \ matplotlib==3.8.2 \ prompt_toolkit==3.0.41 \ protobuf==3.20.3 \ elasticsearch==8.11.0 \ "pydantic<2" \ "ray[default]==2.7.0" \ dataclasses-json==0.6.3 \ cloudpickle==3.0.0 \ ipyparallel==8.6.1 \ orjson==3.9.10 \ numba==0.58.1 \ peewee==3.17.0 \ PyMySQL==1.1.0 \ async-timeout==4.0.3

With this dependencies, I got this error around 30% of times when I execute python script on aws spot instances ray cluster (I tried ondemand instances and no issue at all if I use ondemand instances). Another thing I noticed is that this issue happens more frequently when I left the cluster running for longer period of time (so I think it's due to spot instance lost).

[ec2-user@ip-10-130-5-241 ~]$ ray job submit --working-dir ~/ray_test -- python ray_debug.py
Job submission server address: http://10.130.5.241:8265
2024-02-12 17:01:39,892	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_21afd162505ebe90.zip.
2024-02-12 17:01:39,892	INFO packaging.py:518 -- Creating a file package for local directory '/home/ec2-user/ray_test'.

-------------------------------------------------------
Job 'raysubmit_TCpJyPxi2yppmKnL' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_TCpJyPxi2yppmKnL
  Query the status of the job:
    ray job status raysubmit_TCpJyPxi2yppmKnL
  Request the job to be stopped:
    ray job stop raysubmit_TCpJyPxi2yppmKnL

Tailing logs until the job exits (disable with --no-wait):

---------------------------------------
Job 'raysubmit_TCpJyPxi2yppmKnL' failed
---------------------------------------

Status message: Unexpected error occurred: The actor died unexpectedly before finishing this task.
	class_name: JobSupervisor
	actor_id: 64ef60b8270d87cae4b61a7501000000
	name: _ray_internal_job_actor_raysubmit_TCpJyPxi2yppmKnL
	namespace: SUPERVISOR_ACTOR_RAY_NAMESPACE
The actor is dead because its node has died. Node Id: 2f01240d7514bb758df373378ba943452a8af7a7071d4b9863f3d4dc
The actor never ran - it was cancelled before it started running.

The actual script I used is confidential, but I think this issue occurs regardless of which python script I'm running so I reproduced it with a dummy script here.

I also tried to manually configure with --system-config='{"num_heartbeats_timeout":3000, "heartbeat_timeout_milliseconds":10000}' but I think they are not supported anymore

What I expect to happen: I expect the script to be running with no issue

Versions / Dependencies

Python3.11 python -m pip install \ ipython==8.18.1 \ scipy==1.11.4 \ botocore==1.33.2 \ urllib3==2.0.7 \ boto3==1.33.2 \ s3fs==2023.12.2 \ pandas==2.1.3 \ matplotlib==3.8.2 \ prompt_toolkit==3.0.41 \ protobuf==3.20.3 \ elasticsearch==8.11.0 \ "pydantic<2" \ "ray[default]==2.7.0" \ dataclasses-json==0.6.3 \ cloudpickle==3.0.0 \ ipyparallel==8.6.1 \ orjson==3.9.10 \ numba==0.58.1 \ peewee==3.17.0 \ PyMySQL==1.1.0 \ async-timeout==4.0.3

Reproduction script

import ray
import time

@ray.remote
def do_nothing_for_a_while(seconds):
    """A function that does nothing for a specified number of seconds."""
    time.sleep(seconds)

def main():
    # Initialize Ray
    ray.init()

    # Total wait time in seconds (20 minutes)
    total_wait_time = 20 * 60

    # Number of tasks to divide the wait time into
    num_tasks = 10

    # Time each task should wait
    wait_time_per_task = total_wait_time / num_tasks

    # Launch tasks
    tasks = [do_nothing_for_a_while.remote(wait_time_per_task) for _ in range(num_tasks)]

    # Wait for all tasks to complete
    ray.get(tasks)

    # Shutdown Ray
    ray.shutdown()

if __name__ == "__main__":
    main()

Issue Severity

High: It blocks me from completing my task.

shiyuc6688 avatar Feb 12 '24 17:02 shiyuc6688

Does it happen if you don't use spot instance?

rkooo567 avatar Feb 12 '24 23:02 rkooo567

No it does not happen when I use ondemand instances

shiyuc6688 avatar Feb 13 '24 00:02 shiyuc6688

Just want to add that I have all the logs ready, just let me know which one is required and I can provide them, thanks for helping

shiyuc6688 avatar Feb 16 '24 16:02 shiyuc6688

Hmm isn't that just the actor died because spot instance is interrupted then?

rkooo567 avatar Feb 19 '24 09:02 rkooo567

Yes, that's true, but it happened almost every time when I launch my cluster, I believe launching spot instances cluster should be allowed? is there some configurations I can change to allow ray to work with spot instances cluster? Also, I used to be able to use spot cluster with python3.8.

shiyuc6688 avatar Feb 20 '24 15:02 shiyuc6688

@shiyuc6688,

If you are using spot instance, then you should expect that it may fail and actors on it will fail as well and you need to handle this case: https://docs.ray.io/en/latest/ray-core/fault-tolerance.html

jjyao avatar Mar 04 '24 22:03 jjyao

Ok, thanks

shiyuc6688 avatar Mar 04 '24 23:03 shiyuc6688