ray
ray copied to clipboard
[Job] Failed to schedule supervisor actor leads to job failure
What happened + What you expected to happen
- Submitted a job, but it failed almost immediately with no logs.
- Expected it to run as usual
- The same job ran successfully ~30min before and ~30min after, and we noticed head CPU usage spiked to 100% during the time period specified in the logs.
Logs from running grep <job ID> in the ray/session_latest/logs directory:
gcs_server.out:[2023-05-16 00:00:21,097 W 18 18] (gcs_server) gcs_actor_manager.cc:417: Actor with name '_ray_internal_job_actor_RTFUOWXMKFADKFFG7RLS' was not found.
gcs_server.out:[2023-05-16 00:00:21,139 I 18 18] (gcs_server) gcs_actor_manager.cc:683: Actor name _ray_internal_job_actor_RTFUOWXMKFADKFFG7RLS is cleand up.
events/event_JOBS.log:{"event_id": "7bE0eD27DE95dCd48CfDd7bf2fF46EE1dbEb", "source_type": "JOBS", "source_hostname": "prod-head-8f42x", "source_pid": 173, "message": "Started a ray job RTFUOWXMKFADKFFG7RLS.", "timestamp": "1684195221", "custom_fields": {"submission_id": "RTFUOWXMKFADKFFG7RLS"}, "severity": "INFO", "label": ""}
events/event_JOBS.log:{"event_id": "19f08d61Cb9e7A4F2fEC79FB01e49aE84727", "source_type": "JOBS", "source_hostname": "prod-head-8f42x", "source_pid": 173, "message": "Completed a ray job RTFUOWXMKFADKFFG7RLS with a status PENDING.", "timestamp": "1684195221", "custom_fields": {"submission_id": "RTFUOWXMKFADKFFG7RLS"}, "severity": "INFO", "label": ""}
dashboard.log:2023-05-16 00:00:22,017 INFO web_log.py:206 -- 30.30.155.110 [16/May/2023:00:00:21 +0000] 'GET /api/jobs/RTFUOWXMKFADKFFG7RLS HTTP/1.1' 200 743 bytes 749388 us '-' 'python-requests/2.25.1'
dashboard_agent.log:2023-05-16 00:00:20,991 INFO job_manager.py:891 -- Starting job with submission_id: RTFUOWXMKFADKFFG7RLS
dashboard_agent.log:2023-05-16 00:00:21,142 INFO job_manager.py:684 -- Failed to schedule job RTFUOWXMKFADKFFG7RLS because the supervisor actor could not be scheduled: The actor is not schedulable: The node specified via NodeAffinitySchedulingStrategy doesn't exist any more or is infeasible, and soft=False was specified.
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_173.log:[2023-05-16 00:00:21,098 W 173 173] actor_manager.cc:112: Failed to look up actor with name '_ray_internal_job_actor_RTFUOWXMKFADKFFG7RLS'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_173.log:[2023-05-16 00:00:21,140 I 173 248] task_manager.cc:535: Task failed: SchedulingCancelled: Actor creation cancelled.: Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.dashboard.modules.job.job_manager, class_name=JobSupervisor, function_name=__init__, function_hash=a8fbf10378f54c2db4819596a2370709}, task_id=ffffffffffffffff89f677682ee293763251637d01000000, task_name=_ray_internal_job_actor_RTFUOWXMKFADKFFG7RLS:JobSupervisor.__init__, job_id=01000000, num_args=8, num_returns=1, depth=1, attempt_number=0, actor_creation_task_spec={actor_id=89f677682ee293763251637d01000000, max_restarts=0, max_retries=0, max_concurrency=1000, is_asyncio_actor=1, is_detached=1}, runtime_env_hash=-1400121655, eager_install=1, setup_timeout_seconds=600
Versions / Dependencies
Ray 2.4 Python 3.9.10 Ubuntu 20.04
Reproduction script
Occurs randomly from our perspective, mainly looking for more insight on the error or additional debugging information to collect.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.
Please comment and remove the pending-cleanup label if you believe this issue should remain open.
Thanks for contributing to Ray!