ray
ray copied to clipboard
[Ray Client] - Client server failed with runtime_env container
What happened + What you expected to happen
Hi,
Even though runtime_env containers are still experimental, I've been having success using them at the job level in ray applications launched inside the cluster with the job submission. i.e. the script that runs on the cluster does ray.init(runtime_env={'container': ...})
. That being said, I don't think there's anything wrong with the podman setup on my custom cluster images, inherited from rayproject/ray:2.0.0-py38
.
However, using runtime_env containers with ray client for interactive development leads to the following errors in the initialization of the ray client server.
---------------------------------------------------------------------------
ConnectionAbortedError Traceback (most recent call last)
Cell In [2], line 3
1 import ray
----> 3 ray.init('ray://localhost:10001', runtime_env={
4 'container': {
5 'image': 'docker.io/rayproject/ray:2.0.0-py38',
6 'run_options': ['--cgroups=enabled'],
7 },
8 })
File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
103 if func.__name__ != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)
File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/_private/worker.py:1248, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
1246 passed_kwargs.update(kwargs)
1247 builder._init_args(**passed_kwargs)
-> 1248 ctx = builder.connect()
1249 from ray._private.usage import usage_lib
1251 if passed_kwargs.get("allow_multiple") is True:
File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/client_builder.py:178, in ClientBuilder.connect(self)
175 if self._allow_multiple_connections:
176 old_ray_cxt = ray.util.client.ray.set_context(None)
--> 178 client_info_dict = ray.util.client_connect.connect(
179 self.address,
180 job_config=self._job_config,
181 _credentials=self._credentials,
182 ray_init_kwargs=self._remote_init_kwargs,
183 metadata=self._metadata,
184 )
185 get_dashboard_url = ray.remote(ray._private.worker.get_dashboard_url)
186 dashboard_url = ray.get(get_dashboard_url.options(num_cpus=0).remote())
File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client_connect.py:47, in connect(conn_str, secure, metadata, connection_retries, job_config, namespace, ignore_version, _credentials, ray_init_kwargs)
42 _explicitly_enable_client_mode()
44 # TODO(barakmich): https://github.com/ray-project/ray/issues/13274
45 # for supporting things like cert_path, ca_path, etc and creating
46 # the correct metadata
---> 47 conn = ray.connect(
48 conn_str,
49 job_config=job_config,
50 secure=secure,
51 metadata=metadata,
52 connection_retries=connection_retries,
53 namespace=namespace,
54 ignore_version=ignore_version,
55 _credentials=_credentials,
56 ray_init_kwargs=ray_init_kwargs,
57 )
58 return conn
File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/__init__.py:252, in RayAPIStub.connect(self, *args, **kw_args)
250 def connect(self, *args, **kw_args):
251 self.get_context()._inside_client_test = self._inside_client_test
--> 252 conn = self.get_context().connect(*args, **kw_args)
253 global _lock, _all_contexts
254 with _lock:
File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/__init__.py:102, in _ClientContext.connect(self, conn_str, job_config, secure, metadata, connection_retries, namespace, ignore_version, _credentials, ray_init_kwargs)
94 self.client_worker = Worker(
95 conn_str,
96 secure=secure,
(...)
99 connection_retries=connection_retries,
100 )
101 self.api.worker = self.client_worker
--> 102 self.client_worker._server_init(job_config, ray_init_kwargs)
103 conn_info = self.client_worker.connection_info()
104 self._check_versions(conn_info, ignore_version)
File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/worker.py:838, in Worker._server_init(self, job_config, ray_init_kwargs)
830 response = self.data_client.Init(
831 ray_client_pb2.InitRequest(
832 job_config=serialized_job_config,
(...)
835 )
836 )
837 if not response.ok:
--> 838 raise ConnectionAbortedError(
839 f"Initialization failure from server:\n{response.msg}"
840 )
842 except grpc.RpcError as e:
843 raise decode_exception(e)
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 685, in Datapath
raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.
The file ray_client_server_23000.err
contains
Trying to pull docker.io/rayproject/ray:2.0.0-py38...
Getting image source signatures
Copying blob sha256:d8135c8d3f0ebe84b529d185558505d5dd4b524e282c17b6152aba56b02ed31e
Copying blob sha256:f0d19e69127971cff8b7bfbbe024890de117604b5861e2b106da8cfd3fb81d53
Copying blob sha256:cde2dbf8dc867dda82c869f13f50d1d88a854128ab07916e9df3d45086b1aca3
Copying blob sha256:3b65ec22a9e96affe680712973e88355927506aa3f792ff03330f3a3eb601a98
Copying blob sha256:87f7a5ff197c9418519c096f1f7aa5afceac82f8ada0df33a21a384d55acde5f
Copying blob sha256:8a0031b53b4d14665f9c7ab891ece272998721af9b0d969924d88fc9408ed57c
Copying blob sha256:3b65ec22a9e96affe680712973e88355927506aa3f792ff03330f3a3eb601a98
Copying blob sha256:87f7a5ff197c9418519c096f1f7aa5afceac82f8ada0df33a21a384d55acde5f
Copying blob sha256:8a0031b53b4d14665f9c7ab891ece272998721af9b0d969924d88fc9408ed57c
Copying blob sha256:cde2dbf8dc867dda82c869f13f50d1d88a854128ab07916e9df3d45086b1aca3
Copying blob sha256:d8135c8d3f0ebe84b529d185558505d5dd4b524e282c17b6152aba56b02ed31e
Copying blob sha256:f0d19e69127971cff8b7bfbbe024890de117604b5861e2b106da8cfd3fb81d53
Copying blob sha256:57c67e634ccf3c72945b4da73023e28c0efaae0fa95c8c1644180bd9df46be68
Copying blob sha256:57c67e634ccf3c72945b4da73023e28c0efaae0fa95c8c1644180bd9df46be68
Copying blob sha256:aea4f35623b6f74ffaaf14a60cf010fa0c69942480aeeb34853366ad58fd4c00
Copying blob sha256:aea4f35623b6f74ffaaf14a60cf010fa0c69942480aeeb34853366ad58fd4c00
Copying blob sha256:78f7682f5042b61bad31612b833dde54498ffcebcd18057bcff8255687020ba7
Copying blob sha256:78f7682f5042b61bad31612b833dde54498ffcebcd18057bcff8255687020ba7
Copying config sha256:c3b4447db3d173fcc94d5736ee633a6223ef07efc15a2ba1c69a34f673f6c299
Writing manifest to image destination
Storing signatures
2022-10-31 05:37:33,217 INFO server.py:875 -- Starting Ray Client server on 0.0.0.0:23000
2022-10-31 05:37:38,239 INFO server.py:922 -- 25 idle checks before shutdown.
2022-10-31 05:37:43,249 INFO server.py:922 -- 20 idle checks before shutdown.
2022-10-31 05:37:48,260 INFO server.py:922 -- 15 idle checks before shutdown.
2022-10-31 05:37:53,272 INFO server.py:922 -- 10 idle checks before shutdown.
2022-10-31 05:37:58,282 INFO server.py:922 -- 5 idle checks before shutdown.
I can find more info on ray_client_server.err
,
2022-10-31 05:36:33,435 INFO server.py:875 -- Starting Ray Client server on 0.0.0.0:10001
2022-10-31 05:36:48,552 INFO proxier.py:670 -- New data connection from client 71aa1ee5efa1441b937aecb493ed977f:
2022-10-31 05:36:48,566 INFO proxier.py:229 -- Increasing runtime env reference for ray_client_server_23000.Serialized runtime env is {"container": {"image": "docker.io/rayproject/ray:2.0.0-py38", "run_options": ["--cgroups=enabled"]}}.
2022-10-31 05:38:03,708 ERROR proxier.py:332 -- SpecificServer startup failed for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:03,708 INFO proxier.py:340 -- SpecificServer started on port: 23000 with PID: 229 for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:03,708 ERROR proxier.py:681 -- Server startup failed for client: 71aa1ee5efa1441b937aecb493ed977f, using JobConfig: <ray.job_config.JobConfig object at 0x7f85ec1ee460>!
2022-10-31 05:38:03,709 INFO proxier.py:390 -- Specific server 71aa1ee5efa1441b937aecb493ed977f is no longer running, freeing its port 23000
2022-10-31 05:38:33,710 ERROR proxier.py:379 -- Timeout waiting for channel for 71aa1ee5efa1441b937aecb493ed977f
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 374, in get_channel
grpc.channel_ready_future(server.channel).result(
File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
self._block(timeout)
File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-10-31 05:38:33,711 WARNING proxier.py:777 -- Retrying Logstream connection. 1 attempts failed.
2022-10-31 05:38:33,712 INFO proxier.py:742 -- 71aa1ee5efa1441b937aecb493ed977f last started stream at 1667219808.5511196. Current stream started at 1667219808.5511196.
2022-10-31 05:38:35,713 ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:35,713 WARNING proxier.py:777 -- Retrying Logstream connection. 2 attempts failed.
2022-10-31 05:38:37,715 ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:37,715 WARNING proxier.py:777 -- Retrying Logstream connection. 3 attempts failed.
2022-10-31 05:38:39,717 ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:39,717 WARNING proxier.py:777 -- Retrying Logstream connection. 4 attempts failed.
2022-10-31 05:38:41,719 ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:41,719 WARNING proxier.py:777 -- Retrying Logstream connection. 5 attempts failed
Also on runtime_env_setup-ray_client_server_23000.log
I could find
12022-10-31 05:36:48,569 INFO container.py:47 -- start worker in container with prefix: podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=154 --cgroups=enabled --entrypoint python docker.io/rayproject/ray:2.0.0-py38
I think this issue is related to the connection between the client proxy and client server that seems to run in the container, however, as stated in the logs, the container is created with --net host
flag. I wonder if someone from the ray team could point me towards a workaround, or some documentation regarding the setup of the client servers as I am willing to contribute.
Regarding issue severity, I'll leave it at Medium
since my only alternatives are:
- Pack everything in the cluster image, which is a bit limiting for my setup
- Use conda and wait up to 10 minutes for dependency install
Thanks,.
Versions / Dependencies
About ray
ray[default]==2.0.0
kuberay-operator: kuberay/operator:v0.3.0
Podman installed on cluster base image
(base) ray@lany-cluster-head-bvkg6:~$ podman info
host:
arch: amd64
buildahVersion: 1.23.1
cgroupControllers: []
cgroupManager: cgroupfs
cgroupVersion: v1
conmon:
package: 'conmon: /usr/libexec/podman/conmon'
path: /usr/libexec/podman/conmon
version: 'conmon version 2.1.2, commit: '
cpus: 8
distribution:
codename: focal
distribution: ubuntu
version: "20.04"
eventLogger: file
hostname: lany-cluster-head-bvkg6
idMappings:
gidmap:
- container_id: 0
host_id: 100
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 5.10.133+
linkmode: dynamic
logDriver: k8s-file
memFree: 27025526784
memTotal: 33671999488
ociRuntime:
name: crun
package: 'crun: /usr/bin/crun'
path: /usr/bin/crun
version: |-
crun version UNKNOWN
commit: ea1fe3938eefa14eb707f1d22adff4db670645d6
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
os: linux
remoteSocket:
path: /tmp/podman-run-1000/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: false
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: 'slirp4netns: /usr/bin/slirp4netns'
version: |-
slirp4netns version 1.1.8
commit: unknown
libslirp: 4.3.1-git
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.4.3
swapFree: 0
swapTotal: 0
uptime: 283h 18m 10.55s (Approximately 11.79 days)
plugins:
log:
- k8s-file
- none
- journald
network:
- bridge
- macvlan
volume:
- local
registries:
search:
- docker.io
- quay.io
store:
configFile: /home/ray/.config/containers/storage.conf
containerStore:
number: 1
paused: 0
running: 0
stopped: 1
graphDriverName: overlay
graphOptions:
overlay.mount_program:
Executable: /usr/bin/fuse-overlayfs
Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
Version: |-
fusermount3 version: 3.9.0
fuse-overlayfs: version 1.5
FUSE library version 3.9.0
using FUSE kernel interface version 7.31
graphRoot: /home/ray/.local/share/containers/storage
graphStatus:
Backing Filesystem: overlayfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
imageStore:
number: 1
runRoot: /tmp/podman-run-1000/containers
volumePath: /home/ray/.local/share/containers/storage/volumes
version:
APIVersion: 3.4.2
Built: 0
BuiltTime: Wed Dec 31 16:00:00 1969
GitCommit: ""
GoVersion: go1.15.2
OsArch: linux/amd64
Version: 3.4.2
Reproduction script
import ray
ray.init('ray://localhost:10001', runtime_env={
'container': {
'image': 'docker.io/rayproject/ray:2.0.0-py38',
'run_options': ['--cgroups=enabled'],
},
})
Issue Severity
Medium: It is a significant difficulty but I can work around it.