kuberay
kuberay copied to clipboard
[Bug] Uncommon connection error `JobConfig has no attribute '_parsed_runtime_env'` when connect to a ray cluster generated by kuberay
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
Others
What happened + What you expected to happen
- When I connect to a ray cluster client created by the sample:
kuberay/ray-operator/config/samples/ray-cluster.mini.yaml
usingray.init("ray://127.0.0.1:10001")
there comes some connection error shows that some attribute are missing.
---------------------------------------------------------------------------
ConnectionAbortedError Traceback (most recent call last)
/var/folders/nr/8n943fms20v157b066tvxrbw0000gp/T/ipykernel_46851/3704460965.py in <module>
----> 1 ray.init("ray://127.0.0.1:10001")
/opt/miniconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
103 if func.__name__ != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)
106
107 return wrapper
/opt/miniconda3/lib/python3.8/site-packages/ray/worker.py in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _metrics_export_port, _system_config, _tracing_startup_hook, **kwargs)
773 passed_kwargs.update(kwargs)
774 builder._init_args(**passed_kwargs)
--> 775 return builder.connect()
776
777 if kwargs:
/opt/miniconda3/lib/python3.8/site-packages/ray/client_builder.py in connect(self)
149 old_ray_cxt = ray.util.client.ray.set_context(None)
150
--> 151 client_info_dict = ray.util.client_connect.connect(
152 self.address,
153 job_config=self._job_config,
/opt/miniconda3/lib/python3.8/site-packages/ray/util/client_connect.py in connect(conn_str, secure, metadata, connection_retries, job_config, namespace, ignore_version, _credentials, ray_init_kwargs)
31 # for supporting things like cert_path, ca_path, etc and creating
32 # the correct metadata
---> 33 conn = ray.connect(
34 conn_str,
35 job_config=job_config,
/opt/miniconda3/lib/python3.8/site-packages/ray/util/client/__init__.py in connect(self, *args, **kw_args)
226 def connect(self, *args, **kw_args):
227 self.get_context()._inside_client_test = self._inside_client_test
--> 228 conn = self.get_context().connect(*args, **kw_args)
229 global _lock, _all_contexts
230 with _lock:
/opt/miniconda3/lib/python3.8/site-packages/ray/util/client/__init__.py in connect(self, conn_str, job_config, secure, metadata, connection_retries, namespace, ignore_version, _credentials, ray_init_kwargs)
86 connection_retries=connection_retries)
87 self.api.worker = self.client_worker
---> 88 self.client_worker._server_init(job_config, ray_init_kwargs)
89 conn_info = self.client_worker.connection_info()
90 self._check_versions(conn_info, ignore_version)
/opt/miniconda3/lib/python3.8/site-packages/ray/util/client/worker.py in _server_init(self, job_config, ray_init_kwargs)
695 reconnect_grace_period=self._reconnect_grace_period))
696 if not response.ok:
--> 697 raise ConnectionAbortedError(
698 f"Initialization failure from server:\n{response.msg}")
699
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 622, in Datapath
client_id, job_config):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 272, in start_specific_server
serialized_runtime_env = job_config.get_serialized_runtime_env()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/job_config.py", line 99, in get_serialized_runtime_env
return self._parsed_runtime_env.serialize()
AttributeError: 'JobConfig' object has no attribute '_parsed_runtime_env'
- Expected result is connecting normaly.
- The general service information:
(base) ➜ samples git:(master) ✗ kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
raycluster-mini-head-mls5k 1/1 Running 0 15m 10.244.2.4 ray-test-worker <none> <none>
(base) ➜ samples git:(master) ✗ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 17m
raycluster-mini-head-svc ClusterIP 10.96.112.210 <none> 6379/TCP,8265/TCP,10001/TCP 15m
(base) ➜ samples git:(master) ✗ kubectl get endpoints
NAME ENDPOINTS AGE
kubernetes 172.18.0.4:6443 17m
raycluster-mini-head-svc 10.244.2.4:8265,10.244.2.4:6379,10.244.2.4:10001 15m
(base) ➜ samples git:(master) ✗ kubectl logs raycluster-mini-head-mls5k
2022-02-13 23:38:19,962 INFO services.py:1272 -- View the Ray dashboard at http://10.244.2.4:8265
(base) ➜ samples git:(master) ✗ kubectl port-forward service/raycluster-mini-head-svc 10001:10001
Forwarding from 127.0.0.1:10001 -> 10001
Forwarding from [::1]:10001 -> 10001
Handling connection for 10001
Reproduction script
kind cluster config:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
- containerPort: 10001
hostPort: 10001
protocol: TCP
- role: worker
- role: worker
kuberay installation
kubectl apply -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base"
ray-clluster configuration:
git clone https://github.com/ray-project/kuberay.git
kubectl apply -f "kuberay/ray-operator/config/samples/ray-cluster.mini.yaml"
also I used a port-forward for to redirect the service to localhost:10001
kubectl port-forward service/raycluster-mini-head-svc 10001:10001
Anything else
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
It seems your local ray version is not compatible with the cluster.
It seems your local ray version is not compatible with the cluster.
Yeah, should be the version issue. I meet the same problem using ray 1.9.2 (client) and ray 1.8.0 server.
I change to 1.9.2 and reported error is gone. However, I meet ConnectionError: ray client connection timeout
issue then.
docker run -it --network host rayproject/ray:1.9.2 bash
NodePort works for me but port-forward doesn't seem work well with my env. @scarlet25151 Could you help test it with consistent ray version?
It seems your local ray version is not compatible with the cluster.
Yeah, should be the version issue. I meet the same problem using ray 1.9.2 (client) and ray 1.8.0 server.
I change to 1.9.2 and reported error is gone. However, I meet
ConnectionError: ray client connection timeout
issue then.
docker run -it --network host rayproject/ray:1.9.2 bash
NodePort works for me but port-forward doesn't seem work well with my env. @scarlet25151 Could you help test it with consistent ray version?
Yep, after I try from image version 1.9.2 with same version client, everything seems working normally with ClusterIP service and use port-forward
. I think there is some issue related to restriction in upstream ray repos, this issue can be closed.
@scarlet25151 Can you help add some best practice here https://github.com/ray-project/kuberay/tree/master/docs/best-practice to talk about ray version compatibility? This would be a great example of runbook user can self diagnose.
@scarlet25151 Can you help add some best practice here https://github.com/ray-project/kuberay/tree/master/docs/best-practice to talk about ray version compatibility? This would be a great example of runbook user can self diagnose.
yes sure, and I think that may be a troubleshooting check, so I open a new folder and put it in the #154
The issue is fixed. Close it.