skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Too many `ray job` relatted commands block newly arrived `ray job`

Open Michaelvll opened this issue 3 years ago • 2 comments

Although not much CPU and memory are used, the sky-spot-controller can still fail to take new ray job commands, due to a lot of ray job commands running and the jobs are stuck in INIT mode. That causes the sky spot launch to hang after setup is completed.

One possible reason for that is our skylet will try to update the status of the INIT jobs by querying ray with ray job, causing a lot of connection to the ray dashboard, http://127.0.0.1:8265.

Michaelvll avatar Aug 17 '22 07:08 Michaelvll

We can try to directly query against redis. It's a hack as it assumes ray job internals (redis key formats, etc.).

Is this also related to the check interval being too short here?:

https://github.com/skypilot-org/skypilot/blob/78ce9adb5354b6a14d0ebe734747ead5839d0ad7/sky/skylet/subprocess_daemon.py#L55

concretevitamin avatar Aug 17 '22 14:08 concretevitamin

We can try to directly query against redis. It's a hack as it assumes ray job internals (redis key formats, etc.).

Good point! I am thinking that we can use ray job list offered in ray==1.13 to query all the job status in one call, instead of our previous parallel job status fetching.

Is this also related to the check interval being too short here?

The line seems only to be used in the on-prem situation, i.e. the ray job related commands won't be called in the spot case.

Michaelvll avatar Aug 17 '22 17:08 Michaelvll

I asked them to remove the JobUpdateEvent from the skylet and restart the controller, and it seems that it is now able to run for about 2 hours and is still running (Previously, it would hang after half an hour). https://github.com/skypilot-org/skypilot/blob/619655258e7b1479e5144d5a364257ac5dba5e4d/sky/skylet/skylet.py#L12

~~Therefore, the problem should be caused by the job status update.~~

The problem occurs again after 3 hours of running. There seems another reason for that as well.

Seems ray job hangs even for a simple command

ray job list --address http://127.0.0.1:8265
Job submission server address: http://127.0.0.1:8265

After testing more on the sky-spot-controller, the _get_sdk_client function hangs when getting the submission client. https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/dashboard/modules/job/cli.py#L16-L29

And the reason for the hang is caused by this line r = self._do_request("GET", "/api/version") https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/dashboard/modules/dashboard_sdk.py#L212

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/ray/dashboard/modules/job/cli.py", line 32, in _get_sdk_client
    return JobSubmissionClient(address, create_cluster_if_needed)
  File "/opt/conda/lib/python3.7/site-packages/ray/dashboard/modules/job/sdk.py", line 72, in __init__
    version_error_message="Jobs API is not supported on the Ray "
  File "/opt/conda/lib/python3.7/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 212, in _check_connection_and_version
    r = self._do_request("GET", "/api/version")
  File "/opt/conda/lib/python3.7/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 249, in _do_request
    headers=self._headers,
  File "/opt/conda/lib/python3.7/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
    timeout=timeout,
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
    chunked=chunked,
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/opt/conda/lib/python3.7/http/client.py", line 1373, in getresponse
    response.begin()
  File "/opt/conda/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/opt/conda/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/opt/conda/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt

After I kill the dashboard process manually and start it with the same command, all the ray job related commands work again.

/opt/conda/bin/python3.7 -u /opt/conda/lib/python3.7/site-packages/ray/dashboard/dashboard.py --host=localhost --port=8265 --port-retries=0 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2022-08-18_04-09-07_685484_1420/logs --session-dir=/tmp/ray/session_2022-08-18_04-09-07_685484_1420 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.128.0.44:6379

Michaelvll avatar Aug 18 '22 06:08 Michaelvll

It seems that the ray's job_manager is buggy. Even though they call ray.actor.exit_actor() for the JobSupervisor.run and actor will not be able to find using ray.get_actor with ValueError. The actor already got by ray.get_actor will still be available for calling remote functions, i.e. await job_supervisor.ping.remote() will always be available and the _monitor_job function will be in an infinite loop. https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/dashboard/modules/job/job_manager.py#L332-L335

Here is a code to reproduce: Program 1:

import os
import ray

ray.init('auto', namespace='test-actor')
class MyActor:
  def __init__(self):
      self.a = 1
  def ping(self):
      return os.getpid()
  def exit(self):
      ray.actor.exit_actor()

my_actor_cls = ray.remote(MyActor)
my_actor = my_actor_cls.options(
    lifetime='detached',
    name='my_actor',
    num_cpus=0,
).remote()

Program 2:

import ray
ray.init('auto', namespace='test-actor')
actor = ray.get_actor('my_actor')

Program 3:

import ray
ray.init('auto', namespace='test-actor')
actor = ray.get_actor('my_actor')
actor.exit.remote()

Then in program 2, if we run actor.ping.remote() it can still accessible, even though the ray.get_actor('my_actor') cannot access the actor anymore.

After adding the ray.get() for actor.ping.remote(), the error appears, i.e. the await in the function will actually generate the exception. False alarm....

Michaelvll avatar Aug 19 '22 01:08 Michaelvll

Just found another easier way to reproduce the error:

for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; sleep 1; done

After several hundreds of jobs, ray job list --address=http://127.0.0.1:8265 will fail to connect to the dashboard.

According to the dashboard_agent.log, the raylet is dead.

2022-08-19 08:58:35,043 ERROR agent.py:150 -- Raylet is dead, exiting.

raylet.err shows that

terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
  what():  epoll: Too many open files
*** SIGABRT received at time=1660899119 on cpu 6 ***
[address_is_readable.cc : 96] RAW: Failed to create pipe, errno=24
[failure_signal_handler.cc : 331] RAW: Signal 6 raised at PC=0x7fd2602bd7bb while already in AbslFailureSignalHandler()
*** SIGABRT received at time=1660899119 on cpu 6 ***
[address_is_readable.cc : 96] RAW: Failed to create pipe, errno=24
[failure_signal_handler.cc : 331] RAW: Signal 6 raised at PC=0x7fd2602bd7bb while already in AbslFailureSignalHandler()
*** SIGABRT received at time=1660899119 on cpu 6 ***
[address_is_readable.cc : 96] RAW: Failed to create pipe, errno=24

Michaelvll avatar Aug 19 '22 08:08 Michaelvll