sparkling-water icon indicating copy to clipboard operation
sparkling-water copied to clipboard

Deployment configuration error - request reached a non-leader H2O node

Open KunfuPanda24 opened this issue 2 years ago • 6 comments

Facing following issue, on submitting multiple request to the sparkling water external backend on k8.

22/05/21 12:54:01 INFO H2OContext: Trying to lock H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root.
22/05/21 12:54:01 WARN H2OContext: Locking of the H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root failed.
ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://h2o-service.sparkling-water.svc.cluster.local:54321 is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update(RestCommunication.scala:88)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update$(RestCommunication.scala:81)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.update(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.lockCloud(H2OContextExtensions.scala:235)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.tryToLockCloud(H2OContextExtensions.scala:122)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes(H2OContextExtensions.scala:137)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes$(H2OContextExtensions.scala:131)
	at ai.h2o.sparkling.H2OContext.getAndVerifyWorkerNodes(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.<init>(H2OContext.scala:85)
	at ai.h2o.sparkling.H2OContext$.getOrCreate(H2OContext.scala:470)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/sun.net.NetworkClient.doConnect(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.<init>(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(Unknown Source)
	at ai.h2o.sparkling.backend.utils.RestCommunication.setHeaders(RestCommunication.scala:347)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:385)
	... 26 more
22/05/21 12:54:11 INFO H2OContext: Trying to lock H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root.
22/05/21 12:54:11 WARN H2OContext: Locking of the H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root failed.
ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://h2o-service.sparkling-water.svc.cluster.local:54321 is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update(RestCommunication.scala:88)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update$(RestCommunication.scala:81)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.update(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.lockCloud(H2OContextExtensions.scala:235)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.tryToLockCloud(H2OContextExtensions.scala:122)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes(H2OContextExtensions.scala:137)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes$(H2OContextExtensions.scala:131)
	at ai.h2o.sparkling.H2OContext.getAndVerifyWorkerNodes(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.<init>(H2OContext.scala:85)
	at ai.h2o.sparkling.H2OContext$.getOrCreate(H2OContext.scala:470)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/sun.net.NetworkClient.doConnect(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.<init>(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(Unknown Source)
	at ai.h2o.sparkling.backend.utils.RestCommunication.setHeaders(RestCommunication.scala:347)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:385)
	... 26 more
22/05/21 12:54:21 INFO H2OContext: Trying to lock H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root.
22/05/21 12:54:21 WARN H2OContext: Locking of the H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root failed.
ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://h2o-service.sparkling-water.svc.cluster.local:54321 is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update(RestCommunication.scala:88)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update$(RestCommunication.scala:81)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.update(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.lockCloud(H2OContextExtensions.scala:235)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.tryToLockCloud(H2OContextExtensions.scala:122)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes(H2OContextExtensions.scala:137)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes$(H2OContextExtensions.scala:131)
	at ai.h2o.sparkling.H2OContext.getAndVerifyWorkerNodes(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.<init>(H2OContext.scala:85)
	at ai.h2o.sparkling.H2OContext$.getOrCreate(H2OContext.scala:470)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/sun.net.NetworkClient.doConnect(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.<init>(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(Unknown Source)
	at ai.h2o.sparkling.backend.utils.RestCommunication.setHeaders(RestCommunication.scala:347)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:385)
	... 26 more
22/05/21 12:54:31 INFO H2OContext: Trying to lock H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root.
22/05/21 12:54:31 WARN H2OContext: Locking of the H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root failed.
ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://h2o-service.sparkling-water.svc.cluster.local:54321 is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update(RestCommunication.scala:88)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update$(RestCommunication.scala:81)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.update(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.lockCloud(H2OContextExtensions.scala:235)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.tryToLockCloud(H2OContextExtensions.scala:122)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes(H2OContextExtensions.scala:137)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes$(H2OContextExtensions.scala:131)
	at ai.h2o.sparkling.H2OContext.getAndVerifyWorkerNodes(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.<init>(H2OContext.scala:85)
	at ai.h2o.sparkling.H2OContext$.getOrCreate(H2OContext.scala:470)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/sun.net.NetworkClient.doConnect(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.<init>(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(Unknown Source)
	at ai.h2o.sparkling.backend.utils.RestCommunication.setHeaders(RestCommunication.scala:347)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:385)
	... 26 more
22/05/21 12:54:41 INFO H2OContext: Trying to lock H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root.
22/05/21 12:54:41 WARN H2OContext: Locking of the H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root failed.
ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: H2O node http://h2o-service.sparkling-water.svc.cluster.local:54321 responded with
Status code: 403 : Deployment configuration error - request reached a non-leader H2O node.
Server error: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 Deployment configuration error - request reached a non-leader H2O node.</title>
</head>
<body><h2>HTTP ERROR 403</h2>
<p>Problem accessing /3/CloudLock. Reason:
<pre>    Deployment configuration error - request reached a non-leader H2O node.</pre></p>
</body>
</html>

	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:414)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.checkResponseCode(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update(RestCommunication.scala:88)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update$(RestCommunication.scala:81)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.update(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.lockCloud(H2OContextExtensions.scala:235)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.tryToLockCloud(H2OContextExtensions.scala:122)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes(H2OContextExtensions.scala:137)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes$(H2OContextExtensions.scala:131)
	at ai.h2o.sparkling.H2OContext.getAndVerifyWorkerNodes(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.<init>(H2OContext.scala:85)
	at ai.h2o.sparkling.H2OContext$.getOrCreate(H2OContext.scala:470)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)
22/05/21 12:54:51 INFO H2OContext: Trying to lock H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root.
Traceback (most recent call last):
  File "/opt/spark/work-dir/main.py", line 157, in <module>
    start(proxy_json_object)
  File "/opt/spark/work-dir/main.py", line 99, in start
    mlExecution=MlExecution(sparkSession)
  File "/opt/spark/work-dir/src/mlExecution/mlExecution.py", line 44, in __init__
    hc = H2OContext.getOrCreate()
  File "/usr/local/lib/python3.9/dist-packages/ai/h2o/sparkling/H2OContext.py", line 89, in getOrCreate
    jhc = module.getOrCreate(selected_conf._jconf)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o102.getOrCreate.
: ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster h2o-service.sparkling-water.svc.cluster.local:54321 - root is not reachable.
H2OContext has not been created.
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes(H2OContextExtensions.scala:160)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes$(H2OContextExtensions.scala:131)
	at ai.h2o.sparkling.H2OContext.getAndVerifyWorkerNodes(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.<init>(H2OContext.scala:85)
	at ai.h2o.sparkling.H2OContext$.getOrCreate(H2OContext.scala:470)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: H2O node http://h2o-service.sparkling-water.svc.cluster.local:54321 responded with
Status code: 403 : Deployment configuration error - request reached a non-leader H2O node.
Server error: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 Deployment configuration error - request reached a non-leader H2O node.</title>
</head>
<body><h2>HTTP ERROR 403</h2>
<p>Problem accessing /3/CloudLock. Reason:
<pre>    Deployment configuration error - request reached a non-leader H2O node.</pre></p>
</body>
</html>

	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:414)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.checkResponseCode(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update(RestCommunication.scala:88)
	at ai.h2o.sparkling.backend.utils.RestCommunication.update$(RestCommunication.scala:81)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.update(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.lockCloud(H2OContextExtensions.scala:235)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.tryToLockCloud(H2OContextExtensions.scala:122)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.getAndVerifyWorkerNodes(H2OContextExtensions.scala:137)
	... 15 more

22/05/21 12:54:51 INFO SparkUI: Stopped Spark web UI at http://data-py-1cd60e80e6ab425c-driver-svc.test.svc:4040
22/05/21 12:54:51 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
22/05/21 12:54:51 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
22/05/21 12:54:51 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 2.
22/05/21 12:54:51 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 1.
22/05/21 12:54:51 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
22/05/21 12:54:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/05/21 12:54:52 INFO MemoryStore: MemoryStore cleared
22/05/21 12:54:52 INFO BlockManager: BlockManager stopped
22/05/21 12:54:52 INFO BlockManagerMaster: BlockManagerMaster stopped
22/05/21 12:54:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/05/21 12:54:52 INFO SparkContext: Successfully stopped SparkContext
22/05/21 12:54:52 INFO ShutdownHookManager: Shutdown hook called
22/05/21 12:54:52 INFO ShutdownHookManager: Deleting directory /var/data/spark-5c7c04c2-dc9c-4bc2-92bc-add84c73bddc/spark-98f8740b-aa36-4405-bbdb-d2fff04b9917/pyspark-5b398e5b-54ed-4e2b-b021-4d34cb9a6796
22/05/21 12:54:52 INFO ShutdownHookManager: Deleting directory /var/data/spark-5c7c04c2-dc9c-4bc2-92bc-add84c73bddc/spark-98f8740b-aa36-4405-bbdb-d2fff04b9917
22/05/21 12:54:52 INFO ShutdownHookManager: Deleting directory /tmp/spark-9df6320f-4e5f-400d-95b6-4d9d083853b4
'''

KunfuPanda24 avatar May 21 '22 13:05 KunfuPanda24

hey @gurumoorthy208524, can you please share some details about your environment setup?

krasinski avatar May 22 '22 19:05 krasinski

@krasinski we are using Manual Mode of External Backend on k8 with some changes on the stateful set as mentioned in this post. Note: we are using preemptive nodes for both spark & h2o. I have attached the h2o cluster info below. Let me know if you need any other info. Thanks

H2O_cluster_uptime:         4 hours 9 mins
H2O_cluster_timezone:       Etc/GMT
H2O_data_parsing_timezone:  UTC
H2O_cluster_version:        3.36.0.3
H2O_cluster_version_age:    3 months and 4 days
H2O_cluster_name:           root
H2O_cluster_total_nodes:    5
H2O_cluster_free_memory:    29.48 Gb
H2O_cluster_total_cores:    5
H2O_cluster_allowed_cores:  5
H2O_cluster_status:         locked, healthy
H2O_connection_url:         http://h2o-service.sparkling-water.svc.cluster.local:54321
H2O_connection_proxy:       null
H2O_internal_security:      False
Python_version:             3.9.2 final

KunfuPanda24 avatar May 22 '22 20:05 KunfuPanda24

Hi @gurumoorthy208524 , From the logs you provided, you try to connect to k8s service h2o-service.sparkling-water.svc.cluster.local. Isn't your service named differently? (data-py-9a30b280e693f596-driver-svc.test.svc) and thus set SW property spark.ext.h2o.cloud.representative to data-py-9a30b280e693f596-driver-svc.test.svc:54321?

mn-mikke avatar May 23 '22 14:05 mn-mikke

@mn-mikke Sorry my bad. The value is h2o-service.sparkling-water.svc.cluster.local only. And i have set the SW property to spark.ext.h2o.cloud.representative="h2o-service.sparkling-water.svc.cluster.local".

KunfuPanda24 avatar May 23 '22 14:05 KunfuPanda24

Note: we are using preemptive nodes for both spark & h2o.

@gurumoorthy208524 This a big source of problems. H2O-3 is not fault-tolerant by design due to performance reasons. (all data is kept compressed in memory). H2O cluster requires a static and stable environment. If one of H2O nodes dies or is evicted from the K8s node, the whole cluster gets into a corrupted state and such a cluster needs to restarted and you will have to start from scratch.

mn-mikke avatar May 26 '22 11:05 mn-mikke

@mn-mikke I tried on the non-preemptive nodes and facing following issue. which is fixed according to this jira ticket Current spark version: 3.1.2 H2O Version: 3.36.0.3-1-3.1

22/05/25 16:23:10 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool 
22/05/25 16:23:10 INFO DAGScheduler: ResultStage 7 (runJob at Writer.scala:99) finished in 551.532 s
22/05/25 16:23:10 INFO DAGScheduler: Job 5 is finished. Cancelling potential speculative or zombie tasks for this job
22/05/25 16:23:10 INFO TaskSchedulerImpl: Killing all running tasks in stage 7: Stage finished
22/05/25 16:23:10 INFO DAGScheduler: Job 5 finished: runJob at Writer.scala:99, took 551.562250 s
22/05/25 16:27:44 INFO ContextHandler: Stopped a.h.o.e.j.s.ServletContextHandler@52346010{/,null,UNAVAILABLE}
22/05/25 16:27:44 INFO AbstractConnector: Stopped ServerConnector@3844adb6{HTTP/1.1,[http/1.1]}{0.0.0.0:54321}
Exception in thread "Thread-30" ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster h2o-service-dummy.sparkling-water-dummy.svc.cluster.local:54321 - root is not reachable,
H2OContext has been closed! Please create a new H2OContext to a healthy and reachable (web enabled)
H2O cluster.
	at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:373)
Caused by: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://h2o-service-dummy.sparkling-water-dummy.svc.cluster.local:54321/ is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.query(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo(RestApiUtils.scala:32)
	at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo$(RestApiUtils.scala:30)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.getPingInfo(RestApiUtils.scala:96)
	at ai.h2o.sparkling.H2OContext.ai$h2o$sparkling$H2OContext$$getSparklingWaterHeartbeatEvent(H2OContext.scala:335)
	at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:347)
Caused by: java.net.UnknownHostException: h2o-service-dummy.sparkling-water-dummy.svc.cluster.local
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
	at java.base/java.net.HttpURLConnection.getResponseCode(Unknown Source)
	at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.checkResponseCode(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	... 13 more
Caused by: java.net.UnknownHostException: h2o-service-dummy.sparkling-water-dummy.svc.cluster.local
	at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/sun.net.NetworkClient.doConnect(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.<init>(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
	... 24 more
22/05/25 16:29:21 INFO H2OFrame: H2O node http://h2o-service-dummy.sparkling-water-dummy.svc.cluster.local:54321/3/FinalizeFrame successfully responded for the POST.
2022-05-25 16:29:21,043 : ERROR : src.mlExecution.mlExecution : train : An error occurred while calling o106.fit.
: java.lang.RuntimeException: H2OContext has to be running.
	at ai.h2o.sparkling.H2OContext$.$anonfun$ensure$1(H2OContext.scala:416)
	at scala.Option.getOrElse(Option.scala:189)
	at ai.h2o.sparkling.H2OContext$.ensure(H2OContext.scala:416)
	at ai.h2o.sparkling.H2OFrame$.apply(H2OFrame.scala:287)
	at ai.h2o.sparkling.backend.Writer$.convert(Writer.scala:104)
	at ai.h2o.sparkling.backend.converters.SparkDataFrameConverter$.toH2OFrame(SparkDataFrameConverter.scala:60)
	at ai.h2o.sparkling.H2OContext.$anonfun$asH2OFrame$2(H2OContext.scala:167)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.withConversionDebugPrints(H2OContextExtensions.scala:86)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.withConversionDebugPrints$(H2OContextExtensions.scala:74)
	at ai.h2o.sparkling.H2OContext.withConversionDebugPrints(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.asH2OFrame(H2OContext.scala:167)
	at ai.h2o.sparkling.H2OContext.asH2OFrame(H2OContext.scala:162)
	at ai.h2o.sparkling.ml.algos.H2OAlgoCommonUtils.prepareDatasetForFitting(H2OAlgoCommonUtils.scala:88)
	at ai.h2o.sparkling.ml.algos.H2OAlgoCommonUtils.prepareDatasetForFitting$(H2OAlgoCommonUtils.scala:60)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.prepareDatasetForFitting(H2OAutoML.scala:42)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.fit(H2OAutoML.scala:85)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.fit(H2OAutoML.scala:42)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)

2022-05-25 16:29:21,043 : INFO : __main__ : start : AUTO ML STATUS:False
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib/python3.9/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/lib/python3.9/http/client.py", line 950, in send
    self.connect()
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7d48e87ee0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='main-py-5c0b6180fbf30bff-driver-svc.spark.svc', port=54321): Max retries exceeded with url: /4/sessions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7d48e87ee0>: Failed to establish a new connection: [Errno 111] Connection refused'))```

KunfuPanda24 avatar May 26 '22 12:05 KunfuPanda24