ray icon indicating copy to clipboard operation
ray copied to clipboard

[Ray Client] - Client server failed with runtime_env container

Open igorgad opened this issue 2 years ago • 9 comments

What happened + What you expected to happen

Hi,

Even though runtime_env containers are still experimental, I've been having success using them at the job level in ray applications launched inside the cluster with the job submission. i.e. the script that runs on the cluster does ray.init(runtime_env={'container': ...}). That being said, I don't think there's anything wrong with the podman setup on my custom cluster images, inherited from rayproject/ray:2.0.0-py38.

However, using runtime_env containers with ray client for interactive development leads to the following errors in the initialization of the ray client server.

---------------------------------------------------------------------------
ConnectionAbortedError                    Traceback (most recent call last)
Cell In [2], line 3
      1 import ray
----> 3 ray.init('ray://localhost:10001', runtime_env={
      4     'container': {
      5             'image': 'docker.io/rayproject/ray:2.0.0-py38',
      6             'run_options': ['--cgroups=enabled'],
      7         },
      8 })

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/_private/worker.py:1248, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
   1246 passed_kwargs.update(kwargs)
   1247 builder._init_args(**passed_kwargs)
-> 1248 ctx = builder.connect()
   1249 from ray._private.usage import usage_lib
   1251 if passed_kwargs.get("allow_multiple") is True:

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/client_builder.py:178, in ClientBuilder.connect(self)
    175 if self._allow_multiple_connections:
    176     old_ray_cxt = ray.util.client.ray.set_context(None)
--> 178 client_info_dict = ray.util.client_connect.connect(
    179     self.address,
    180     job_config=self._job_config,
    181     _credentials=self._credentials,
    182     ray_init_kwargs=self._remote_init_kwargs,
    183     metadata=self._metadata,
    184 )
    185 get_dashboard_url = ray.remote(ray._private.worker.get_dashboard_url)
    186 dashboard_url = ray.get(get_dashboard_url.options(num_cpus=0).remote())

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client_connect.py:47, in connect(conn_str, secure, metadata, connection_retries, job_config, namespace, ignore_version, _credentials, ray_init_kwargs)
     42 _explicitly_enable_client_mode()
     44 # TODO(barakmich): https://github.com/ray-project/ray/issues/13274
     45 # for supporting things like cert_path, ca_path, etc and creating
     46 # the correct metadata
---> 47 conn = ray.connect(
     48     conn_str,
     49     job_config=job_config,
     50     secure=secure,
     51     metadata=metadata,
     52     connection_retries=connection_retries,
     53     namespace=namespace,
     54     ignore_version=ignore_version,
     55     _credentials=_credentials,
     56     ray_init_kwargs=ray_init_kwargs,
     57 )
     58 return conn

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/__init__.py:252, in RayAPIStub.connect(self, *args, **kw_args)
    250 def connect(self, *args, **kw_args):
    251     self.get_context()._inside_client_test = self._inside_client_test
--> 252     conn = self.get_context().connect(*args, **kw_args)
    253     global _lock, _all_contexts
    254     with _lock:

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/__init__.py:102, in _ClientContext.connect(self, conn_str, job_config, secure, metadata, connection_retries, namespace, ignore_version, _credentials, ray_init_kwargs)
     94 self.client_worker = Worker(
     95     conn_str,
     96     secure=secure,
   (...)
     99     connection_retries=connection_retries,
    100 )
    101 self.api.worker = self.client_worker
--> 102 self.client_worker._server_init(job_config, ray_init_kwargs)
    103 conn_info = self.client_worker.connection_info()
    104 self._check_versions(conn_info, ignore_version)

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/worker.py:838, in Worker._server_init(self, job_config, ray_init_kwargs)
    830     response = self.data_client.Init(
    831         ray_client_pb2.InitRequest(
    832             job_config=serialized_job_config,
   (...)
    835         )
    836     )
    837     if not response.ok:
--> 838         raise ConnectionAbortedError(
    839             f"Initialization failure from server:\n{response.msg}"
    840         )
    842 except grpc.RpcError as e:
    843     raise decode_exception(e)

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 685, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.

The file ray_client_server_23000.err contains

Trying to pull docker.io/rayproject/ray:2.0.0-py38...
Getting image source signatures
Copying blob sha256:d8135c8d3f0ebe84b529d185558505d5dd4b524e282c17b6152aba56b02ed31e
Copying blob sha256:f0d19e69127971cff8b7bfbbe024890de117604b5861e2b106da8cfd3fb81d53
Copying blob sha256:cde2dbf8dc867dda82c869f13f50d1d88a854128ab07916e9df3d45086b1aca3
Copying blob sha256:3b65ec22a9e96affe680712973e88355927506aa3f792ff03330f3a3eb601a98
Copying blob sha256:87f7a5ff197c9418519c096f1f7aa5afceac82f8ada0df33a21a384d55acde5f
Copying blob sha256:8a0031b53b4d14665f9c7ab891ece272998721af9b0d969924d88fc9408ed57c
Copying blob sha256:3b65ec22a9e96affe680712973e88355927506aa3f792ff03330f3a3eb601a98
Copying blob sha256:87f7a5ff197c9418519c096f1f7aa5afceac82f8ada0df33a21a384d55acde5f
Copying blob sha256:8a0031b53b4d14665f9c7ab891ece272998721af9b0d969924d88fc9408ed57c
Copying blob sha256:cde2dbf8dc867dda82c869f13f50d1d88a854128ab07916e9df3d45086b1aca3
Copying blob sha256:d8135c8d3f0ebe84b529d185558505d5dd4b524e282c17b6152aba56b02ed31e
Copying blob sha256:f0d19e69127971cff8b7bfbbe024890de117604b5861e2b106da8cfd3fb81d53
Copying blob sha256:57c67e634ccf3c72945b4da73023e28c0efaae0fa95c8c1644180bd9df46be68
Copying blob sha256:57c67e634ccf3c72945b4da73023e28c0efaae0fa95c8c1644180bd9df46be68
Copying blob sha256:aea4f35623b6f74ffaaf14a60cf010fa0c69942480aeeb34853366ad58fd4c00
Copying blob sha256:aea4f35623b6f74ffaaf14a60cf010fa0c69942480aeeb34853366ad58fd4c00
Copying blob sha256:78f7682f5042b61bad31612b833dde54498ffcebcd18057bcff8255687020ba7
Copying blob sha256:78f7682f5042b61bad31612b833dde54498ffcebcd18057bcff8255687020ba7
Copying config sha256:c3b4447db3d173fcc94d5736ee633a6223ef07efc15a2ba1c69a34f673f6c299
Writing manifest to image destination
Storing signatures
2022-10-31 05:37:33,217	INFO server.py:875 -- Starting Ray Client server on 0.0.0.0:23000
2022-10-31 05:37:38,239	INFO server.py:922 -- 25 idle checks before shutdown.
2022-10-31 05:37:43,249	INFO server.py:922 -- 20 idle checks before shutdown.
2022-10-31 05:37:48,260	INFO server.py:922 -- 15 idle checks before shutdown.
2022-10-31 05:37:53,272	INFO server.py:922 -- 10 idle checks before shutdown.
2022-10-31 05:37:58,282	INFO server.py:922 -- 5 idle checks before shutdown.

I can find more info on ray_client_server.err,

2022-10-31 05:36:33,435	INFO server.py:875 -- Starting Ray Client server on 0.0.0.0:10001
2022-10-31 05:36:48,552	INFO proxier.py:670 -- New data connection from client 71aa1ee5efa1441b937aecb493ed977f: 
2022-10-31 05:36:48,566	INFO proxier.py:229 -- Increasing runtime env reference for ray_client_server_23000.Serialized runtime env is {"container": {"image": "docker.io/rayproject/ray:2.0.0-py38", "run_options": ["--cgroups=enabled"]}}.
2022-10-31 05:38:03,708	ERROR proxier.py:332 -- SpecificServer startup failed for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:03,708	INFO proxier.py:340 -- SpecificServer started on port: 23000 with PID: 229 for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:03,708	ERROR proxier.py:681 -- Server startup failed for client: 71aa1ee5efa1441b937aecb493ed977f, using JobConfig: <ray.job_config.JobConfig object at 0x7f85ec1ee460>!
2022-10-31 05:38:03,709	INFO proxier.py:390 -- Specific server 71aa1ee5efa1441b937aecb493ed977f is no longer running, freeing its port 23000
2022-10-31 05:38:33,710	ERROR proxier.py:379 -- Timeout waiting for channel for 71aa1ee5efa1441b937aecb493ed977f
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 374, in get_channel
    grpc.channel_ready_future(server.channel).result(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-10-31 05:38:33,711	WARNING proxier.py:777 -- Retrying Logstream connection. 1 attempts failed.
2022-10-31 05:38:33,712	INFO proxier.py:742 -- 71aa1ee5efa1441b937aecb493ed977f last started stream at 1667219808.5511196. Current stream started at 1667219808.5511196.
2022-10-31 05:38:35,713	ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:35,713	WARNING proxier.py:777 -- Retrying Logstream connection. 2 attempts failed.
2022-10-31 05:38:37,715	ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:37,715	WARNING proxier.py:777 -- Retrying Logstream connection. 3 attempts failed.
2022-10-31 05:38:39,717	ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:39,717	WARNING proxier.py:777 -- Retrying Logstream connection. 4 attempts failed.
2022-10-31 05:38:41,719	ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:41,719	WARNING proxier.py:777 -- Retrying Logstream connection. 5 attempts failed

Also on runtime_env_setup-ray_client_server_23000.log I could find

12022-10-31 05:36:48,569	INFO container.py:47 -- start worker in container with prefix: podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=154 --cgroups=enabled --entrypoint python docker.io/rayproject/ray:2.0.0-py38

I think this issue is related to the connection between the client proxy and client server that seems to run in the container, however, as stated in the logs, the container is created with --net host flag. I wonder if someone from the ray team could point me towards a workaround, or some documentation regarding the setup of the client servers as I am willing to contribute.

Regarding issue severity, I'll leave it at Medium since my only alternatives are:

  • Pack everything in the cluster image, which is a bit limiting for my setup
  • Use conda and wait up to 10 minutes for dependency install

Thanks,.

Versions / Dependencies

About ray

ray[default]==2.0.0
kuberay-operator: kuberay/operator:v0.3.0

Podman installed on cluster base image

(base) ray@lany-cluster-head-bvkg6:~$ podman info
host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.1.2, commit: '
  cpus: 8
  distribution:
    codename: focal
    distribution: ubuntu
    version: "20.04"
  eventLogger: file
  hostname: lany-cluster-head-bvkg6
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 100
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.10.133+
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 27025526784
  memTotal: 33671999488
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version UNKNOWN
      commit: ea1fe3938eefa14eb707f1d22adff4db670645d6
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /tmp/podman-run-1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.8
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.4.3
  swapFree: 0
  swapTotal: 0
  uptime: 283h 18m 10.55s (Approximately 11.79 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/ray/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 0
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
      Version: |-
        fusermount3 version: 3.9.0
        fuse-overlayfs: version 1.5
        FUSE library version 3.9.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/ray/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: overlayfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 1
  runRoot: /tmp/podman-run-1000/containers
  volumePath: /home/ray/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.2
  Built: 0
  BuiltTime: Wed Dec 31 16:00:00 1969
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 3.4.2

Reproduction script

import ray

ray.init('ray://localhost:10001', runtime_env={
    'container': {
            'image': 'docker.io/rayproject/ray:2.0.0-py38',
            'run_options': ['--cgroups=enabled'],
        },
})

Issue Severity

Medium: It is a significant difficulty but I can work around it.

igorgad avatar Oct 31 '22 12:10 igorgad