ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

RayOnSpark on K8s can not enable dashboard

Open shanyu-sys opened this issue 2 years ago • 10 comments

I cannot enable ray dashboard while using RayOnSpark on K8s with cluster mode.

Environment

  • python=3.6.10
  • ray=1.9.2
  • protobuf=3.19.4
  • prometheus-client = 0.14.1

code:

sc = init_orca_context(cluster_mode="spark-submit", init_ray_on_spark=True, include_webui=True)
 
# I have also tried to set dashboard-host to 0.0.0.0, while the same error still exists
extra_params={"dashboard-host": "0.0.0.0"}
sc = init_orca_context(cluster_mode="spark-submit", init_ray_on_spark=True, include_webui=True, extra_params=extra_params)

Error log in dashboard_agent.log on worker:

2022-07-06 13:32:16,935 ERROR agent.py:415 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/dashboard/agent.py", line 376, in <module>
    loop.run_until_complete(agent.run())
  File "/opt/spark/work-dir/python_env/lib/python3.6/asyncio/base_events.py", line 488, in run_until_complete
    return future.result()
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/dashboard/agent.py", line 164, in run
    modules = self._load_modules()
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/dashboard/agent.py", line 111, in _load_modules
    c = cls(self)
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 155, in __init__
    dashboard_agent.metrics_export_port)
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/_private/metrics_agent.py", line 79, in __init__
    address=metrics_export_address)))
  File "/opt/sark/work-dir/python_env/lib/python3.6/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
    options=option, gatherer=option.registry, collector=collector)
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
    port=self.options.port, addr=str(self.options.address))
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/opt/spark/work-dir/python_env/lib/python3.6/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/opt/spark/work-dir/python_env/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

Error log printed on driver:

2022-07-06 06:14:06,062 WARNING worker.py:1245 -- (ip=10.244.7.11) The agent on node bert-c3e38e81d223777b-exec-1 failed to be restarted 5 times. There are 3 possible problems if you see this error.
  1. The dashboard might not display correct information on this node.
  2. Metrics on this node won't be reported.
  3. runtime_env APIs won't work.
Check out the `dashboard_agent.log` to see the detailed failure messages.

shanyu-sys avatar Jul 07 '22 02:07 shanyu-sys

I had the same problem when deployed on yarn cluster, the difference is that I didn't specify the init_ray_on_spark, include_webui and dashboard-host params. This error log was printed when the model training.

Environment

bigdl-dllib=2.0.0 bigdl-friesian=2.0.0 ray=1.9.2 tensorflow=2.9.1

init code

sc = init_orca_context("yarn", cores=36,
                       num_nodes=3, memory="100g",
                       driver_cores=12, driver_memory="36g",
                       conf=conf, object_store_memory="80g",
                       env={"KMP_BLOCKTIME": "1",
                            "KMP_AFFINITY": "granularity=fine,compact,1,0",
                            "OMP_NUM_THREADS": "28"})

Error log in dashboard_agent.log

2022-07-12 15:32:56,916 ERROR gcs_utils.py:137 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1657611176.915955922","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3128,"referenced_errors":[{"created":"@1657611176.915954278","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-12 15:32:57,919 ERROR reporter_agent.py:545 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 535, in _perform_iteration
    DEBUG_AUTOSCALING_STATUS)
  File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/experimental/internal_kv.py", line 43, in _internal_kv_get
    return global_gcs_client.internal_kv_get(key, namespace)
  File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 130, in wrapper
    return f(self, *args, **kwargs)
  File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 233, in internal_kv_get
    reply = self._kv_stub.InternalKVGet(req)
  File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/root/anaconda3/envs/recsys-demo-bigdl/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1657611177.919050357","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3128,"referenced_errors":[{"created":"@1657611177.919048749","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

hoshibara avatar Jul 12 '22 10:07 hoshibara

Strange that we are getting this error even if include_webui default to be False...

hkvision avatar Jul 13 '22 02:07 hkvision

Strange that we are getting this error even if include_webui default to be False...

Actually it defaults to be True in RayOnSparkContext.

shanyu-sys avatar Jul 13 '22 02:07 shanyu-sys

Pending for discussion whether to change the default to False?

hkvision avatar Jul 13 '22 05:07 hkvision

The same problems also occur when running in yarn cluster, which may need further verification

lalalapotter avatar Jul 23 '22 03:07 lalalapotter

@yushan111 If you have time, can you help verify if you can still use dashboard on yarn?

hkvision avatar Jul 23 '22 03:07 hkvision

@yushan111 If you have time, can you help verify if you can still use dashboard on yarn?

Sure. I tested yesterday and the dashboard on yarn works fine.

shanyu-sys avatar Jul 26 '22 01:07 shanyu-sys

Seems on colab local mode also have this issue....

hkvision avatar Jul 26 '22 02:07 hkvision

image

hkvision avatar Jul 26 '22 02:07 hkvision

Seems on colab local mode also have this issue....

The same issue exists when I run on k8s local mode. Maybe it is related to the virtual IP. Does pure ray has the same issue? Or is it only introduced by RayOnSpark?

shanyu-sys avatar Jul 26 '22 03:07 shanyu-sys

It is due to the version of prometheus-client. The error occurs when using 0.14.1, while 0.11.0 works.

We have tested on both Yarn and Kubernetes, with python 3.6 and 3.7.

Versions other than 0.11.0 haven't been tested yet, it may also work on other versions <0.14.1. We could conduct more tests on different versions

shanyu-sys avatar Aug 26 '22 04:08 shanyu-sys