ipex-llm
ipex-llm copied to clipboard
RayOnSpark on K8s can not enable dashboard
I cannot enable ray dashboard while using RayOnSpark on K8s with cluster mode.
Environment
- python=3.6.10
- ray=1.9.2
- protobuf=3.19.4
- prometheus-client = 0.14.1
code:
sc = init_orca_context(cluster_mode="spark-submit", init_ray_on_spark=True, include_webui=True)
# I have also tried to set dashboard-host to 0.0.0.0, while the same error still exists
extra_params={"dashboard-host": "0.0.0.0"}
sc = init_orca_context(cluster_mode="spark-submit", init_ray_on_spark=True, include_webui=True, extra_params=extra_params)
Error log in dashboard_agent.log
on worker:
2022-07-06 13:32:16,935 ERROR agent.py:415 -- [Errno -2] Name or service not known
Traceback (most recent call last):
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/dashboard/agent.py", line 376, in <module>
loop.run_until_complete(agent.run())
File "/opt/spark/work-dir/python_env/lib/python3.6/asyncio/base_events.py", line 488, in run_until_complete
return future.result()
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/dashboard/agent.py", line 164, in run
modules = self._load_modules()
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/dashboard/agent.py", line 111, in _load_modules
c = cls(self)
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 155, in __init__
dashboard_agent.metrics_export_port)
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/_private/metrics_agent.py", line 79, in __init__
address=metrics_export_address)))
File "/opt/sark/work-dir/python_env/lib/python3.6/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
options=option, gatherer=option.registry, collector=collector)
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
port=self.options.port, addr=str(self.options.address))
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/opt/spark/work-dir/python_env/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
Error log printed on driver:
2022-07-06 06:14:06,062 WARNING worker.py:1245 -- (ip=10.244.7.11) The agent on node bert-c3e38e81d223777b-exec-1 failed to be restarted 5 times. There are 3 possible problems if you see this error.
1. The dashboard might not display correct information on this node.
2. Metrics on this node won't be reported.
3. runtime_env APIs won't work.
Check out the `dashboard_agent.log` to see the detailed failure messages.
I had the same problem when deployed on yarn cluster, the difference is that I didn't specify the init_ray_on_spark, include_webui and dashboard-host params. This error log was printed when the model training.
Environment
bigdl-dllib=2.0.0 bigdl-friesian=2.0.0 ray=1.9.2 tensorflow=2.9.1
init code
sc = init_orca_context("yarn", cores=36,
num_nodes=3, memory="100g",
driver_cores=12, driver_memory="36g",
conf=conf, object_store_memory="80g",
env={"KMP_BLOCKTIME": "1",
"KMP_AFFINITY": "granularity=fine,compact,1,0",
"OMP_NUM_THREADS": "28"})
Error log in dashboard_agent.log
2022-07-12 15:32:56,916 ERROR gcs_utils.py:137 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1657611176.915955922","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3128,"referenced_errors":[{"created":"@1657611176.915954278","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-12 15:32:57,919 ERROR reporter_agent.py:545 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 535, in _perform_iteration
DEBUG_AUTOSCALING_STATUS)
File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/experimental/internal_kv.py", line 43, in _internal_kv_get
return global_gcs_client.internal_kv_get(key, namespace)
File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 130, in wrapper
return f(self, *args, **kwargs)
File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 233, in internal_kv_get
reply = self._kv_stub.InternalKVGet(req)
File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1657611177.919050357","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3128,"referenced_errors":[{"created":"@1657611177.919048749","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
Strange that we are getting this error even if include_webui default to be False...
Strange that we are getting this error even if include_webui default to be False...
Actually it defaults to be True in RayOnSparkContext.
Pending for discussion whether to change the default to False?
The same problems also occur when running in yarn cluster, which may need further verification
@yushan111 If you have time, can you help verify if you can still use dashboard on yarn?
@yushan111 If you have time, can you help verify if you can still use dashboard on yarn?
Sure. I tested yesterday and the dashboard on yarn works fine.
Seems on colab local mode also have this issue....
Seems on colab local mode also have this issue....
The same issue exists when I run on k8s local mode. Maybe it is related to the virtual IP. Does pure ray has the same issue? Or is it only introduced by RayOnSpark?
It is due to the version of prometheus-client
. The error occurs when using 0.14.1, while 0.11.0 works.
We have tested on both Yarn and Kubernetes, with python 3.6 and 3.7.
Versions other than 0.11.0 haven't been tested yet, it may also work on other versions <0.14.1. We could conduct more tests on different versions