skypilot
skypilot copied to clipboard
Too many `ray job` relatted commands block newly arrived `ray job`
Although not much CPU and memory are used, the sky-spot-controller can still fail to take new ray job commands, due to a lot of ray job commands running and the jobs are stuck in INIT mode. That causes the sky spot launch to hang after setup is completed.
One possible reason for that is our skylet will try to update the status of the INIT jobs by querying ray with ray job, causing a lot of connection to the ray dashboard, http://127.0.0.1:8265.
We can try to directly query against redis. It's a hack as it assumes ray job internals (redis key formats, etc.).
Is this also related to the check interval being too short here?:
https://github.com/skypilot-org/skypilot/blob/78ce9adb5354b6a14d0ebe734747ead5839d0ad7/sky/skylet/subprocess_daemon.py#L55
We can try to directly query against redis. It's a hack as it assumes ray job internals (redis key formats, etc.).
Good point! I am thinking that we can use ray job list offered in ray==1.13 to query all the job status in one call, instead of our previous parallel job status fetching.
Is this also related to the check interval being too short here?
The line seems only to be used in the on-prem situation, i.e. the ray job related commands won't be called in the spot case.
I asked them to remove the JobUpdateEvent from the skylet and restart the controller, and it seems that it is now able to run for about 2 hours and is still running (Previously, it would hang after half an hour).
https://github.com/skypilot-org/skypilot/blob/619655258e7b1479e5144d5a364257ac5dba5e4d/sky/skylet/skylet.py#L12
~~Therefore, the problem should be caused by the job status update.~~
The problem occurs again after 3 hours of running. There seems another reason for that as well.
Seems ray job hangs even for a simple command
ray job list --address http://127.0.0.1:8265
Job submission server address: http://127.0.0.1:8265
After testing more on the sky-spot-controller, the _get_sdk_client function hangs when getting the submission client.
https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/dashboard/modules/job/cli.py#L16-L29
And the reason for the hang is caused by this line r = self._do_request("GET", "/api/version")
https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/dashboard/modules/dashboard_sdk.py#L212
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/ray/dashboard/modules/job/cli.py", line 32, in _get_sdk_client
return JobSubmissionClient(address, create_cluster_if_needed)
File "/opt/conda/lib/python3.7/site-packages/ray/dashboard/modules/job/sdk.py", line 72, in __init__
version_error_message="Jobs API is not supported on the Ray "
File "/opt/conda/lib/python3.7/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 212, in _check_connection_and_version
r = self._do_request("GET", "/api/version")
File "/opt/conda/lib/python3.7/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 249, in _do_request
headers=self._headers,
File "/opt/conda/lib/python3.7/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
timeout=timeout,
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/opt/conda/lib/python3.7/http/client.py", line 1373, in getresponse
response.begin()
File "/opt/conda/lib/python3.7/http/client.py", line 319, in begin
version, status, reason = self._read_status()
File "/opt/conda/lib/python3.7/http/client.py", line 280, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/opt/conda/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
KeyboardInterrupt
After I kill the dashboard process manually and start it with the same command, all the ray job related commands work again.
/opt/conda/bin/python3.7 -u /opt/conda/lib/python3.7/site-packages/ray/dashboard/dashboard.py --host=localhost --port=8265 --port-retries=0 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2022-08-18_04-09-07_685484_1420/logs --session-dir=/tmp/ray/session_2022-08-18_04-09-07_685484_1420 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.128.0.44:6379
It seems that the ray's job_manager is buggy. Even though they call ray.actor.exit_actor() for the JobSupervisor.run and actor will not be able to find using ray.get_actor with ValueError. The actor already got by ray.get_actor will still be available for calling remote functions, i.e. await job_supervisor.ping.remote() will always be available and the _monitor_job function will be in an infinite loop.
https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/dashboard/modules/job/job_manager.py#L332-L335
Here is a code to reproduce: Program 1:
import os
import ray
ray.init('auto', namespace='test-actor')
class MyActor:
def __init__(self):
self.a = 1
def ping(self):
return os.getpid()
def exit(self):
ray.actor.exit_actor()
my_actor_cls = ray.remote(MyActor)
my_actor = my_actor_cls.options(
lifetime='detached',
name='my_actor',
num_cpus=0,
).remote()
Program 2:
import ray
ray.init('auto', namespace='test-actor')
actor = ray.get_actor('my_actor')
Program 3:
import ray
ray.init('auto', namespace='test-actor')
actor = ray.get_actor('my_actor')
actor.exit.remote()
Then in program 2, if we run actor.ping.remote() it can still accessible, even though the ray.get_actor('my_actor') cannot access the actor anymore.
After adding the ray.get() for actor.ping.remote(), the error appears, i.e. the await in the function will actually generate the exception. False alarm....
Just found another easier way to reproduce the error:
for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; sleep 1; done
After several hundreds of jobs, ray job list --address=http://127.0.0.1:8265 will fail to connect to the dashboard.
According to the dashboard_agent.log, the raylet is dead.
2022-08-19 08:58:35,043 ERROR agent.py:150 -- Raylet is dead, exiting.
raylet.err shows that
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
what(): epoll: Too many open files
*** SIGABRT received at time=1660899119 on cpu 6 ***
[address_is_readable.cc : 96] RAW: Failed to create pipe, errno=24
[failure_signal_handler.cc : 331] RAW: Signal 6 raised at PC=0x7fd2602bd7bb while already in AbslFailureSignalHandler()
*** SIGABRT received at time=1660899119 on cpu 6 ***
[address_is_readable.cc : 96] RAW: Failed to create pipe, errno=24
[failure_signal_handler.cc : 331] RAW: Signal 6 raised at PC=0x7fd2602bd7bb while already in AbslFailureSignalHandler()
*** SIGABRT received at time=1660899119 on cpu 6 ***
[address_is_readable.cc : 96] RAW: Failed to create pipe, errno=24