kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] Uncommon connection error `JobConfig has no attribute '_parsed_runtime_env'` when connect to a ray cluster generated by kuberay

Open scarlet25151 opened this issue 3 years ago • 5 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

  1. When I connect to a ray cluster client created by the sample: kuberay/ray-operator/config/samples/ray-cluster.mini.yaml using ray.init("ray://127.0.0.1:10001") there comes some connection error shows that some attribute are missing.
---------------------------------------------------------------------------
ConnectionAbortedError                    Traceback (most recent call last)
/var/folders/nr/8n943fms20v157b066tvxrbw0000gp/T/ipykernel_46851/3704460965.py in <module>
----> 1 ray.init("ray://127.0.0.1:10001")

/opt/miniconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
    103             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105         return func(*args, **kwargs)
    106 
    107     return wrapper

/opt/miniconda3/lib/python3.8/site-packages/ray/worker.py in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _metrics_export_port, _system_config, _tracing_startup_hook, **kwargs)
    773         passed_kwargs.update(kwargs)
    774         builder._init_args(**passed_kwargs)
--> 775         return builder.connect()
    776 
    777     if kwargs:

/opt/miniconda3/lib/python3.8/site-packages/ray/client_builder.py in connect(self)
    149             old_ray_cxt = ray.util.client.ray.set_context(None)
    150 
--> 151         client_info_dict = ray.util.client_connect.connect(
    152             self.address,
    153             job_config=self._job_config,

/opt/miniconda3/lib/python3.8/site-packages/ray/util/client_connect.py in connect(conn_str, secure, metadata, connection_retries, job_config, namespace, ignore_version, _credentials, ray_init_kwargs)
     31     # for supporting things like cert_path, ca_path, etc and creating
     32     # the correct metadata
---> 33     conn = ray.connect(
     34         conn_str,
     35         job_config=job_config,

/opt/miniconda3/lib/python3.8/site-packages/ray/util/client/__init__.py in connect(self, *args, **kw_args)
    226     def connect(self, *args, **kw_args):
    227         self.get_context()._inside_client_test = self._inside_client_test
--> 228         conn = self.get_context().connect(*args, **kw_args)
    229         global _lock, _all_contexts
    230         with _lock:

/opt/miniconda3/lib/python3.8/site-packages/ray/util/client/__init__.py in connect(self, conn_str, job_config, secure, metadata, connection_retries, namespace, ignore_version, _credentials, ray_init_kwargs)
     86                 connection_retries=connection_retries)
     87             self.api.worker = self.client_worker
---> 88             self.client_worker._server_init(job_config, ray_init_kwargs)
     89             conn_info = self.client_worker.connection_info()
     90             self._check_versions(conn_info, ignore_version)

/opt/miniconda3/lib/python3.8/site-packages/ray/util/client/worker.py in _server_init(self, job_config, ray_init_kwargs)
    695                     reconnect_grace_period=self._reconnect_grace_period))
    696             if not response.ok:
--> 697                 raise ConnectionAbortedError(
    698                     f"Initialization failure from server:\n{response.msg}")
    699 

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 622, in Datapath
    client_id, job_config):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 272, in start_specific_server
    serialized_runtime_env = job_config.get_serialized_runtime_env()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/job_config.py", line 99, in get_serialized_runtime_env
    return self._parsed_runtime_env.serialize()
AttributeError: 'JobConfig' object has no attribute '_parsed_runtime_env'
  1. Expected result is connecting normaly.
  2. The general service information:
(base) ➜  samples git:(master) ✗ kubectl get po -owide
NAME                         READY   STATUS    RESTARTS   AGE   IP           NODE              NOMINATED NODE   READINESS GATES
raycluster-mini-head-mls5k   1/1     Running   0          15m   10.244.2.4   ray-test-worker   <none>           <none>
(base) ➜  samples git:(master) ✗ kubectl get svc
NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
kubernetes                 ClusterIP   10.96.0.1       <none>        443/TCP                       17m
raycluster-mini-head-svc   ClusterIP   10.96.112.210   <none>        6379/TCP,8265/TCP,10001/TCP   15m
(base) ➜  samples git:(master) ✗ kubectl get endpoints
NAME                       ENDPOINTS                                          AGE
kubernetes                 172.18.0.4:6443                                    17m
raycluster-mini-head-svc   10.244.2.4:8265,10.244.2.4:6379,10.244.2.4:10001   15m
(base) ➜  samples git:(master) ✗ kubectl logs raycluster-mini-head-mls5k
2022-02-13 23:38:19,962	INFO services.py:1272 -- View the Ray dashboard at http://10.244.2.4:8265
(base) ➜  samples git:(master) ✗ kubectl port-forward service/raycluster-mini-head-svc 10001:10001
Forwarding from 127.0.0.1:10001 -> 10001
Forwarding from [::1]:10001 -> 10001
Handling connection for 10001

image

Reproduction script

kind cluster config:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
  - containerPort: 10001
    hostPort: 10001
    protocol: TCP
- role: worker
- role: worker

kuberay installation

kubectl apply -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base"

ray-clluster configuration:

git clone https://github.com/ray-project/kuberay.git
kubectl apply -f "kuberay/ray-operator/config/samples/ray-cluster.mini.yaml"

also I used a port-forward for to redirect the service to localhost:10001

 kubectl port-forward service/raycluster-mini-head-svc 10001:10001

Anything else

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

scarlet25151 avatar Feb 14 '22 08:02 scarlet25151

It seems your local ray version is not compatible with the cluster.

chenk008 avatar Feb 14 '22 08:02 chenk008

It seems your local ray version is not compatible with the cluster.

Yeah, should be the version issue. I meet the same problem using ray 1.9.2 (client) and ray 1.8.0 server.

I change to 1.9.2 and reported error is gone. However, I meet ConnectionError: ray client connection timeout issue then.

docker run -it --network host rayproject/ray:1.9.2 bash

NodePort works for me but port-forward doesn't seem work well with my env. @scarlet25151 Could you help test it with consistent ray version?

Jeffwan avatar Feb 14 '22 10:02 Jeffwan

It seems your local ray version is not compatible with the cluster.

Yeah, should be the version issue. I meet the same problem using ray 1.9.2 (client) and ray 1.8.0 server.

I change to 1.9.2 and reported error is gone. However, I meet ConnectionError: ray client connection timeout issue then.

docker run -it --network host rayproject/ray:1.9.2 bash

NodePort works for me but port-forward doesn't seem work well with my env. @scarlet25151 Could you help test it with consistent ray version?

Yep, after I try from image version 1.9.2 with same version client, everything seems working normally with ClusterIP service and use port-forward. I think there is some issue related to restriction in upstream ray repos, this issue can be closed.

scarlet25151 avatar Feb 14 '22 23:02 scarlet25151

@scarlet25151 Can you help add some best practice here https://github.com/ray-project/kuberay/tree/master/docs/best-practice to talk about ray version compatibility? This would be a great example of runbook user can self diagnose.

Jeffwan avatar Feb 15 '22 00:02 Jeffwan

@scarlet25151 Can you help add some best practice here https://github.com/ray-project/kuberay/tree/master/docs/best-practice to talk about ray version compatibility? This would be a great example of runbook user can self diagnose.

yes sure, and I think that may be a troubleshooting check, so I open a new folder and put it in the #154

scarlet25151 avatar Feb 19 '22 00:02 scarlet25151

The issue is fixed. Close it.

kevin85421 avatar Apr 27 '23 00:04 kevin85421