ray icon indicating copy to clipboard operation
ray copied to clipboard

[Job] Failed to schedule supervisor actor leads to job failure

Open spolcyn opened this issue 2 years ago • 1 comments

What happened + What you expected to happen

  1. Submitted a job, but it failed almost immediately with no logs.
  2. Expected it to run as usual
  3. The same job ran successfully ~30min before and ~30min after, and we noticed head CPU usage spiked to 100% during the time period specified in the logs.

Logs from running grep <job ID> in the ray/session_latest/logs directory:

gcs_server.out:[2023-05-16 00:00:21,097 W 18 18] (gcs_server) gcs_actor_manager.cc:417: Actor with name '_ray_internal_job_actor_RTFUOWXMKFADKFFG7RLS' was not found.
gcs_server.out:[2023-05-16 00:00:21,139 I 18 18] (gcs_server) gcs_actor_manager.cc:683: Actor name _ray_internal_job_actor_RTFUOWXMKFADKFFG7RLS is cleand up.
events/event_JOBS.log:{"event_id": "7bE0eD27DE95dCd48CfDd7bf2fF46EE1dbEb", "source_type": "JOBS", "source_hostname": "prod-head-8f42x", "source_pid": 173, "message": "Started a ray job RTFUOWXMKFADKFFG7RLS.", "timestamp": "1684195221", "custom_fields": {"submission_id": "RTFUOWXMKFADKFFG7RLS"}, "severity": "INFO", "label": ""}
events/event_JOBS.log:{"event_id": "19f08d61Cb9e7A4F2fEC79FB01e49aE84727", "source_type": "JOBS", "source_hostname": "prod-head-8f42x", "source_pid": 173, "message": "Completed a ray job RTFUOWXMKFADKFFG7RLS with a status PENDING.", "timestamp": "1684195221", "custom_fields": {"submission_id": "RTFUOWXMKFADKFFG7RLS"}, "severity": "INFO", "label": ""}
dashboard.log:2023-05-16 00:00:22,017   INFO web_log.py:206 -- 30.30.155.110 [16/May/2023:00:00:21 +0000] 'GET /api/jobs/RTFUOWXMKFADKFFG7RLS HTTP/1.1' 200 743 bytes 749388 us '-' 'python-requests/2.25.1'
dashboard_agent.log:2023-05-16 00:00:20,991     INFO job_manager.py:891 -- Starting job with submission_id: RTFUOWXMKFADKFFG7RLS
dashboard_agent.log:2023-05-16 00:00:21,142     INFO job_manager.py:684 -- Failed to schedule job RTFUOWXMKFADKFFG7RLS because the supervisor actor could not be scheduled: The actor is not schedulable: The node specified via NodeAffinitySchedulingStrategy doesn't exist any more or is infeasible, and soft=False was specified.
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_173.log:[2023-05-16 00:00:21,098 W 173 173] actor_manager.cc:112: Failed to look up actor with name '_ray_internal_job_actor_RTFUOWXMKFADKFFG7RLS'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_173.log:[2023-05-16 00:00:21,140 I 173 248] task_manager.cc:535: Task failed: SchedulingCancelled: Actor creation cancelled.: Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.dashboard.modules.job.job_manager, class_name=JobSupervisor, function_name=__init__, function_hash=a8fbf10378f54c2db4819596a2370709}, task_id=ffffffffffffffff89f677682ee293763251637d01000000, task_name=_ray_internal_job_actor_RTFUOWXMKFADKFFG7RLS:JobSupervisor.__init__, job_id=01000000, num_args=8, num_returns=1, depth=1, attempt_number=0, actor_creation_task_spec={actor_id=89f677682ee293763251637d01000000, max_restarts=0, max_retries=0, max_concurrency=1000, is_asyncio_actor=1, is_detached=1}, runtime_env_hash=-1400121655, eager_install=1, setup_timeout_seconds=600

Versions / Dependencies

Ray 2.4 Python 3.9.10 Ubuntu 20.04

Reproduction script

Occurs randomly from our perspective, mainly looking for more insight on the error or additional debugging information to collect.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

spolcyn avatar May 16 '23 14:05 spolcyn

This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.

Please comment and remove the pending-cleanup label if you believe this issue should remain open.

Thanks for contributing to Ray!

cszhu avatar Jun 17 '25 00:06 cszhu