sparkling-water icon indicating copy to clipboard operation
sparkling-water copied to clipboard

H2O node/pod becomes unhealthy

Open RajatSablok opened this issue 2 years ago • 7 comments

Hi team,

Every so often, 1 of my 2 h2o pods becomes unhealthy and starts printing logs like the below:

05-20 11:11:05.353 xx.yyy.zzz.ww:54321   1      9.28:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.io.IOException: Connection timed out
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:?]
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:58) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:50) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:459) ~[?:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:609) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-20 11:13:20.521 xx.yyy.zzz.ww:54321   1      9.28:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-20 11:15:35.689 xx.yyy.zzz.ww:54321   1      9.28:54321 ERROR water.default: Got IO error when sending a batch of bytes: 

Because of this, the next set of jobs only connect to one pod:

22/05/20 12:09:02 INFO H2OContext: Sparkling Water 3.36.0.3-1-3.1 started, status of context: 
Sparkling Water Context:
 * Sparkling Water Version: 3.36.0.3-1-3.1
 * H2O name: 185
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (0,xx.yyy.www.aa,54321)
  ------------------------

  Open H2O Flow in browser: http://main-py-50429080e15e8842-driver-svc.default.svc:54321 (CMD + click in Mac OSX)

     
Connecting to H2O server at http://main-py-50429080e15e8842-driver-svc.default.svc:54321 ... successful.
--------------------------  ---------------------------------------------------------
H2O_cluster_uptime:         1 hour 10 mins
H2O_cluster_timezone:       Etc/GMT
H2O_data_parsing_timezone:  UTC
H2O_cluster_version:        3.36.0.3
H2O_cluster_version_age:    3 months and 3 days
H2O_cluster_name:           root
H2O_cluster_total_nodes:    1
H2O_cluster_free_memory:    6.306 Gb
H2O_cluster_total_cores:    1
H2O_cluster_allowed_cores:  1
H2O_cluster_status:         locked, healthy
H2O_connection_url:         http://main-py-50429080e15e8842-driver-svc.default.svc:54321
H2O_connection_proxy:       null
H2O_internal_security:      False
Python_version:             3.9.2 final
--------------------------  ---------------------------------------------------------

And as a result of all this, I get this error on my spark pods:

2022-05-20 12:22:30,184 : CRITICAL : src.mlExecution.mlExecution : train : Error in mlExecution: An error occurred while calling o119.fit.
: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://service-name.default.svc.cluster.local:54321 is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.H2OJob$.readURLContent(H2OJob.scala:54)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.H2OJob$.request(H2OJob.scala:54)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
	at ai.h2o.sparkling.backend.H2OJob$.query(H2OJob.scala:54)
	at ai.h2o.sparkling.backend.H2OJob$.ai$h2o$sparkling$backend$H2OJob$$verifyAndGetJob(H2OJob.scala:63)
	at ai.h2o.sparkling.backend.H2OJob.waitForFinishAndPrintProgress(H2OJob.scala:32)
	at ai.h2o.sparkling.ml.utils.EstimatorCommonUtils.trainAndGetDestinationKey(EstimatorCommonUtils.scala:44)
	at ai.h2o.sparkling.ml.utils.EstimatorCommonUtils.trainAndGetDestinationKey$(EstimatorCommonUtils.scala:30)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.trainAndGetDestinationKey(H2OAutoML.scala:42)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.fit(H2OAutoML.scala:90)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
	at java.base/java.net.HttpURLConnection.getResponseCode(Unknown Source)
	at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.backend.H2OJob$.checkResponseCode(H2OJob.scala:54)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	... 25 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/sun.net.NetworkClient.doConnect(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
	... 36 more

h2o.exceptions.H2OServerError: HTTP 502 Bad Gateway:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 502 Bad Gateway</title>
</head>
<body><h2>HTTP ERROR 502</h2>
<p>Problem accessing /4/sessions. Reason:
<pre>    Bad Gateway</pre></p>
</body>
</html>

Can someone help with how to resolve this issue? It's happening very frequently

RajatSablok avatar May 20 '22 13:05 RajatSablok

Hi @RajatSablok, Can you share complete logs from both H2O nodes (pods)?

mn-mikke avatar May 20 '22 14:05 mn-mikke

Hi @mn-mikke,

I lost the logs for the above case. But we were able to recreate this bug with 5 nodes. 1 pod got evicted, here are the logs of the remaining ones.

1st:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by ai.h2o.xgboost4j.java.NativeLibLoader (file:/opt/h2oai/h2o-3/h2o.jar) to field java.lang.ClassLoader.usr_paths
WARNING: Please consider reporting this to the maintainers of ai.h2o.xgboost4j.java.NativeLibLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
09:27:36.730 [main] WARN  hex.tree.xgboost.util.NativeLibrary - Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
09:27:36.734 [main] WARN  hex.tree.xgboost.util.NativeLibrary - Failed to load library from both native path and jar!
09:27:36.734 [main] INFO  hex.tree.xgboost.util.NativeLibraryLoaderChain - Cannot load library: xgboost4j_gpu (lib/linux_64/libxgboost4j_gpu.so)
09:27:36.788 [main] INFO  hex.tree.xgboost.util.NativeLibrary - Loaded library from lib/linux_64/libxgboost4j_minimal.so (/tmp/libxgboost4j_minimal6093114958104634135.so)
09:27:37.371 [main] INFO  water.k8s.H2OCluster - Starting Kubernetes-related REST API services
09:27:37.442 [main] INFO  water.k8s.H2OCluster - Kubernetes REST API services successfully started.
09:27:37.442 [main] INFO  water.k8s.H2OCluster - Initializing H2O Kubernetes cluster
09:27:37.443 [main] INFO  water.k8s.H2OCluster - Timeout contraint: 180 seconds.
09:27:37.443 [main] INFO  water.k8s.H2OCluster - Cluster size constraint: 2 nodes.
09:27:37.490 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - Timeout for node discovery is set to 180 seconds.
09:27:37.490 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - Desired cluster size is set to 2 nodes.
09:27:37.516 [main] WARN  water.k8s.lookup.KubernetesDnsLookup - DNS name not found [response code 3]
09:27:38.518 [main] WARN  water.k8s.lookup.KubernetesDnsLookup - DNS name not found [response code 3]
09:27:39.519 [main] WARN  water.k8s.lookup.KubernetesDnsLookup - DNS name not found [response code 3]
09:27:40.520 [main] WARN  water.k8s.lookup.KubernetesDnsLookup - DNS name not found [response code 3]
09:27:41.521 [main] WARN  water.k8s.lookup.KubernetesDnsLookup - DNS name not found [response code 3]
09:27:42.531 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.aa' discovered.
09:28:13.561 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.bb' discovered.
09:28:13.561 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.cc' discovered.
09:28:45.670 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.dd' discovered.
09:28:45.670 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-4.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.ee' discovered.
09:29:13.742 [main] ERROR water.k8s.lookup.KubernetesDnsLookup - Unknown host for IP Address: h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local.
09:29:45.772 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.ff' discovered.
09:30:37.840 [main] INFO  water.k8s.H2OCluster - Using the following pods to form H2O cluster: [xx.yyy.zzz.bb,xx.yyy.zzz.ee,xx.yyy.zzz.aa,xx.yyy.zzz.dd,xx.yyy.zzz.cc,xx.yyy.zzz.ff]
2022-05-22 09:30:38.291:INFO::main: Logging initialized @185635ms to org.eclipse.jetty.util.log.StdErrLog
05-22 09:30:39.016 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Dynamically loaded 'water.k8s.KubernetesEmbeddedConfigProvider' as AbstractEmbeddedH2OConfigProvider.
05-22 09:30:39.017 xx.yyy.zzz.aa:54321   1            main  INFO water.default: ----- H2O started  -----
05-22 09:30:39.017 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Build git branch: rel-zorn
05-22 09:30:39.018 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Build git hash: 717d8bf831d5d6b0decda9c37a2a20de9a491754
05-22 09:30:39.018 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Build git describe: jenkins-3.36.0.2-53-g717d8bf
05-22 09:30:39.019 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Build project version: 3.36.0.3
05-22 09:30:39.019 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Build age: 3 months and 5 days
05-22 09:30:39.019 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Built by: 'jenkins'
05-22 09:30:39.020 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Built on: '2022-02-16 17:51:32'
05-22 09:30:39.020 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Found H2O Core extensions: [StackTraceCollector, XGBoost, KrbStandalone, Infogram]
05-22 09:30:39.021 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Processed H2O arguments: []
05-22 09:30:39.021 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Java availableProcessors: 1
05-22 09:30:39.021 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Java heap totalMemory: 203.0 MB
05-22 09:30:39.022 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Java heap maxMemory: 6.32 GB
05-22 09:30:39.022 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Java version: Java 11.0.14 (from Red Hat, Inc.)
05-22 09:30:39.022 xx.yyy.zzz.aa:54321   1            main  INFO water.default: JVM launch parameters: [-XX:+UseContainerSupport, -XX:MaxRAMPercentage=50]
05-22 09:30:39.023 xx.yyy.zzz.aa:54321   1            main  INFO water.default: JVM process id: 1@h2o-stateful-set-0
05-22 09:30:39.023 xx.yyy.zzz.aa:54321   1            main  INFO water.default: OS version: Linux 5.4.170+ (amd64)
05-22 09:30:39.023 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Machine physical memory: 13.07 GB
05-22 09:30:39.024 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Machine locale: en_US
05-22 09:30:39.024 xx.yyy.zzz.aa:54321   1            main  INFO water.default: X-h2o-cluster-id: 1653211653917
05-22 09:30:39.024 xx.yyy.zzz.aa:54321   1            main  INFO water.default: User name: 'root'
05-22 09:30:39.025 xx.yyy.zzz.aa:54321   1            main  INFO water.default: IPv6 stack selected: false
05-22 09:30:39.025 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Possible IP Address: eth0 (eth0), xx.yyy.zzz.aa
05-22 09:30:39.025 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Possible IP Address: lo (lo), 127.0.0.1
05-22 09:30:39.026 xx.yyy.zzz.aa:54321   1            main  INFO water.default: H2O node running in unencrypted mode.
05-22 09:30:39.027 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Internal communication uses port: 54322
05-22 09:30:39.028 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Listening for HTTP and REST traffic on http://xx.yyy.zzz.aa:54321/
05-22 09:30:39.029 xx.yyy.zzz.aa:54321   1            main  INFO water.default: H2O cloud name: 'root' on /xx.yyy.zzz.aa:54321, static configuration based on -flatfile null
05-22 09:30:39.029 xx.yyy.zzz.aa:54321   1            main  INFO water.default: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
05-22 09:30:39.029 xx.yyy.zzz.aa:54321   1            main  INFO water.default:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 [email protected]'
05-22 09:30:39.030 xx.yyy.zzz.aa:54321   1            main  INFO water.default:   2. Point your browser to http://localhost:55555
05-22 09:30:40.342 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Kerberos not configured
05-22 09:30:40.342 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Log dir: '/tmp/h2o-root/h2ologs'
05-22 09:30:40.342 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Cur dir: '/'
05-22 09:30:40.348 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Subsystem for distributed import from HTTP/HTTPS successfully initialized
05-22 09:30:40.348 xx.yyy.zzz.aa:54321   1            main  INFO water.default: HDFS subsystem successfully initialized
05-22 09:30:40.350 xx.yyy.zzz.aa:54321   1            main  INFO water.default: S3 subsystem successfully initialized
05-22 09:30:40.359 xx.yyy.zzz.aa:54321   1            main  INFO water.default: GCS subsystem successfully initialized
05-22 09:30:40.359 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Flow dir: '/root/h2oflows'
05-22 09:30:40.371 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Cloud of size 1 formed [h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa:54321]
05-22 09:30:40.371 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Created cluster of size 1, leader node IP is 'h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa'
05-22 09:30:40.378 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
05-22 09:30:40.379 xx.yyy.zzz.aa:54321   1            main  INFO water.default: StackTraceCollector extension initialized
05-22 09:30:40.379 xx.yyy.zzz.aa:54321   1            main  INFO water.default: XGBoost extension initialized
05-22 09:30:40.379 xx.yyy.zzz.aa:54321   1            main  INFO water.default: KrbStandalone extension initialized
05-22 09:30:40.380 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Infogram extension initialized
05-22 09:30:40.380 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Registered 4 core extensions in: 2442ms
05-22 09:30:40.380 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Registered H2O core extensions: [StackTraceCollector, XGBoost, KrbStandalone, Infogram]
05-22 09:30:40.380 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Registered: 1 auth extensions in: 181036ms
05-22 09:30:40.381 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Registered Auth extensions: [water.server.LeaderNodeRequestFilter]
05-22 09:30:40.507 xx.yyy.zzz.aa:54321   1            main  INFO hex.tree.xgboost.XGBoostExtension: Found XGBoost backend with library: xgboost4j_minimal
05-22 09:30:40.507 xx.yyy.zzz.aa:54321   1            main  WARN hex.tree.xgboost.XGBoostExtension: Your system supports only minimal version of XGBoost (no GPUs, no multithreading)!
05-22 09:30:40.599 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Registered: 275 REST APIs in: 218ms
05-22 09:30:40.599 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Registered REST API extensions: [Amazon S3, XGBoost, Algos, Sparkling Water REST API Extensions, Infogram, AutoML, Core V3, TargetEncoder, Core V4]
05-22 09:30:40.721 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Registered: 330 schemas in 121ms
05-22 09:30:40.721 xx.yyy.zzz.aa:54321   1            main  INFO water.default: H2O started in 186791ms
05-22 09:30:40.721 xx.yyy.zzz.aa:54321   1            main  INFO water.default: 
05-22 09:30:40.722 xx.yyy.zzz.aa:54321   1            main  INFO water.default: Open H2O Flow in your web browser: http://xx.yyy.zzz.aa:54321
05-22 09:30:40.722 xx.yyy.zzz.aa:54321   1            main  INFO water.default: 
05-22 09:30:57.427 xx.yyy.zzz.aa:54321   1        FJ-126-1  INFO water.default: Cloud of size 2 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa:54321]
05-22 09:30:57.427 xx.yyy.zzz.aa:54321   1        FJ-126-1  INFO water.default: Created cluster of size 2, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:31:17.450 xx.yyy.zzz.aa:54321   1        FJ-126-1  INFO water.default: Cloud of size 3 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321, h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa:54321]
05-22 09:31:17.451 xx.yyy.zzz.aa:54321   1        FJ-126-1  INFO water.default: Created cluster of size 3, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:31:30.892 xx.yyy.zzz.aa:54321   1        FJ-126-1  INFO water.default: Cloud of size 4 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321, xx.yyy.zzz.ee/xx.yyy.zzz.ee:54321, h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa:54321]
05-22 09:31:30.892 xx.yyy.zzz.aa:54321   1        FJ-126-1  INFO water.default: Created cluster of size 4, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:32:27.236 xx.yyy.zzz.aa:54321   1        FJ-126-1  INFO water.default: Cloud of size 5 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321, xx.yyy.zzz.ee/xx.yyy.zzz.ee:54321, h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa:54321, h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ff:54321]
05-22 09:32:27.237 xx.yyy.zzz.aa:54321   1        FJ-126-1  INFO water.default: Created cluster of size 5, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:33:45.613 xx.yyy.zzz.aa:54321   1        FJ-123-1  INFO water.default: Locking cloud to new members, because Class Id=52
05-22 09:33:45.614 xx.yyy.zzz.aa:54321   1        FJ-123-1  WARN water.default: Flatfile entry ignored: Node xx.yyy.zzz.cc:54321 not active in this cloud. Removing it from the list.
05-22 12:19:59.816 xx.yyy.zzz.aa:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.io.IOException: Connection timed out
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:?]
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:58) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:50) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:459) ~[?:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:609) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:22:14.984 xx.yyy.zzz.aa:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:24:30.152 xx.yyy.zzz.aa:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]

2nd:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by ai.h2o.xgboost4j.java.NativeLibLoader (file:/opt/h2oai/h2o-3/h2o.jar) to field java.lang.ClassLoader.usr_paths
WARNING: Please consider reporting this to the maintainers of ai.h2o.xgboost4j.java.NativeLibLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
09:29:22.455 [main] WARN  hex.tree.xgboost.util.NativeLibrary - Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
09:29:22.458 [main] WARN  hex.tree.xgboost.util.NativeLibrary - Failed to load library from both native path and jar!
09:29:22.459 [main] INFO  hex.tree.xgboost.util.NativeLibraryLoaderChain - Cannot load library: xgboost4j_gpu (lib/linux_64/libxgboost4j_gpu.so)
09:29:22.483 [main] INFO  hex.tree.xgboost.util.NativeLibrary - Loaded library from lib/linux_64/libxgboost4j_minimal.so (/tmp/libxgboost4j_minimal14627765795297121317.so)
09:29:22.695 [main] INFO  water.k8s.H2OCluster - Starting Kubernetes-related REST API services
09:29:22.715 [main] INFO  water.k8s.H2OCluster - Kubernetes REST API services successfully started.
09:29:22.715 [main] INFO  water.k8s.H2OCluster - Initializing H2O Kubernetes cluster
09:29:22.716 [main] INFO  water.k8s.H2OCluster - Timeout contraint: 180 seconds.
09:29:22.716 [main] INFO  water.k8s.H2OCluster - Cluster size constraint: 2 nodes.
09:29:22.741 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - Timeout for node discovery is set to 180 seconds.
09:29:22.742 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - Desired cluster size is set to 2 nodes.
09:29:22.765 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.bb' discovered.
09:29:22.766 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.aa' discovered.
09:29:22.767 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-4.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.ee' discovered.
09:29:22.767 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.dd' discovered.
09:29:45.791 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.ff' discovered.
09:32:22.971 [main] INFO  water.k8s.H2OCluster - Using the following pods to form H2O cluster: [xx.yyy.zzz.bb,xx.yyy.zzz.ee,xx.yyy.zzz.aa,xx.yyy.zzz.dd,xx.yyy.zzz.ff]
2022-05-22 09:32:23.023:INFO::main: Logging initialized @181999ms to org.eclipse.jetty.util.log.StdErrLog
05-22 09:32:23.180 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Dynamically loaded 'water.k8s.KubernetesEmbeddedConfigProvider' as AbstractEmbeddedH2OConfigProvider.
05-22 09:32:23.181 xx.yyy.zzz.ff:54321   1            main  INFO water.default: ----- H2O started  -----
05-22 09:32:23.181 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Build git branch: rel-zorn
05-22 09:32:23.182 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Build git hash: 717d8bf831d5d6b0decda9c37a2a20de9a491754
05-22 09:32:23.182 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Build git describe: jenkins-3.36.0.2-53-g717d8bf
05-22 09:32:23.183 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Build project version: 3.36.0.3
05-22 09:32:23.183 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Build age: 3 months and 5 days
05-22 09:32:23.183 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Built by: 'jenkins'
05-22 09:32:23.184 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Built on: '2022-02-16 17:51:32'
05-22 09:32:23.184 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Found H2O Core extensions: [StackTraceCollector, XGBoost, KrbStandalone, Infogram]
05-22 09:32:23.184 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Processed H2O arguments: []
05-22 09:32:23.185 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Java availableProcessors: 1
05-22 09:32:23.185 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Java heap totalMemory: 203.0 MB
05-22 09:32:23.185 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Java heap maxMemory: 6.32 GB
05-22 09:32:23.186 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Java version: Java 11.0.14 (from Red Hat, Inc.)
05-22 09:32:23.186 xx.yyy.zzz.ff:54321   1            main  INFO water.default: JVM launch parameters: [-XX:+UseContainerSupport, -XX:MaxRAMPercentage=50]
05-22 09:32:23.186 xx.yyy.zzz.ff:54321   1            main  INFO water.default: JVM process id: 1@h2o-stateful-set-2
05-22 09:32:23.187 xx.yyy.zzz.ff:54321   1            main  INFO water.default: OS version: Linux 5.4.170+ (amd64)
05-22 09:32:23.187 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Machine physical memory: 13.07 GB
05-22 09:32:23.187 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Machine locale: en_US
05-22 09:32:23.188 xx.yyy.zzz.ff:54321   1            main  INFO water.default: X-h2o-cluster-id: 1653211761175
05-22 09:32:23.188 xx.yyy.zzz.ff:54321   1            main  INFO water.default: User name: 'root'
05-22 09:32:23.188 xx.yyy.zzz.ff:54321   1            main  INFO water.default: IPv6 stack selected: false
05-22 09:32:23.189 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Possible IP Address: eth0 (eth0), xx.yyy.zzz.ff
05-22 09:32:23.189 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Possible IP Address: lo (lo), 127.0.0.1
05-22 09:32:23.189 xx.yyy.zzz.ff:54321   1            main  INFO water.default: H2O node running in unencrypted mode.
05-22 09:32:23.191 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Internal communication uses port: 54322
05-22 09:32:23.191 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Listening for HTTP and REST traffic on http://xx.yyy.zzz.ff:54321/
05-22 09:32:23.192 xx.yyy.zzz.ff:54321   1            main  INFO water.default: H2O cloud name: 'root' on /xx.yyy.zzz.ff:54321, static configuration based on -flatfile null
05-22 09:32:23.192 xx.yyy.zzz.ff:54321   1            main  INFO water.default: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
05-22 09:32:23.193 xx.yyy.zzz.ff:54321   1            main  INFO water.default:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 [email protected]'
05-22 09:32:23.193 xx.yyy.zzz.ff:54321   1            main  INFO water.default:   2. Point your browser to http://localhost:55555
05-22 09:32:23.734 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Kerberos not configured
05-22 09:32:23.734 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Log dir: '/tmp/h2o-root/h2ologs'
05-22 09:32:23.734 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Cur dir: '/'
05-22 09:32:23.739 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Subsystem for distributed import from HTTP/HTTPS successfully initialized
05-22 09:32:23.740 xx.yyy.zzz.ff:54321   1            main  INFO water.default: HDFS subsystem successfully initialized
05-22 09:32:23.742 xx.yyy.zzz.ff:54321   1            main  INFO water.default: S3 subsystem successfully initialized
05-22 09:32:23.755 xx.yyy.zzz.ff:54321   1            main  INFO water.default: GCS subsystem successfully initialized
05-22 09:32:23.755 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Flow dir: '/root/h2oflows'
05-22 09:32:23.765 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Cloud of size 1 formed [h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ff:54321]
05-22 09:32:23.766 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Created cluster of size 1, leader node IP is 'h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ff'
05-22 09:32:23.773 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
05-22 09:32:23.774 xx.yyy.zzz.ff:54321   1            main  INFO water.default: StackTraceCollector extension initialized
05-22 09:32:23.775 xx.yyy.zzz.ff:54321   1            main  INFO water.default: XGBoost extension initialized
05-22 09:32:23.775 xx.yyy.zzz.ff:54321   1            main  INFO water.default: KrbStandalone extension initialized
05-22 09:32:23.775 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Infogram extension initialized
05-22 09:32:23.776 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Registered 4 core extensions in: 1248ms
05-22 09:32:23.776 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Registered H2O core extensions: [StackTraceCollector, XGBoost, KrbStandalone, Infogram]
05-22 09:32:23.776 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Registered: 1 auth extensions in: 180488ms
05-22 09:32:23.776 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Registered Auth extensions: [water.server.LeaderNodeRequestFilter]
05-22 09:32:23.894 xx.yyy.zzz.ff:54321   1            main  INFO hex.tree.xgboost.XGBoostExtension: Found XGBoost backend with library: xgboost4j_minimal
05-22 09:32:23.894 xx.yyy.zzz.ff:54321   1            main  WARN hex.tree.xgboost.XGBoostExtension: Your system supports only minimal version of XGBoost (no GPUs, no multithreading)!
05-22 09:32:23.975 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Registered: 275 REST APIs in: 199ms
05-22 09:32:23.975 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Registered REST API extensions: [Amazon S3, XGBoost, Algos, Sparkling Water REST API Extensions, Infogram, AutoML, Core V3, TargetEncoder, Core V4]
05-22 09:32:24.073 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Registered: 330 schemas in 97ms
05-22 09:32:24.073 xx.yyy.zzz.ff:54321   1            main  INFO water.default: H2O started in 182892ms
05-22 09:32:24.074 xx.yyy.zzz.ff:54321   1            main  INFO water.default: 
05-22 09:32:24.074 xx.yyy.zzz.ff:54321   1            main  INFO water.default: Open H2O Flow in your web browser: http://xx.yyy.zzz.ff:54321
05-22 09:32:24.074 xx.yyy.zzz.ff:54321   1            main  INFO water.default: 
05-22 09:32:27.236 xx.yyy.zzz.ff:54321   1        FJ-126-1  INFO water.default: Cloud of size 5 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, xx.yyy.zzz.dd/xx.yyy.zzz.dd:54321, xx.yyy.zzz.ee/xx.yyy.zzz.ee:54321, xx.yyy.zzz.aa/xx.yyy.zzz.aa:54321, h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ff:54321]
05-22 09:32:27.237 xx.yyy.zzz.ff:54321   1        FJ-126-1  INFO water.default: Created cluster of size 5, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:33:45.699 xx.yyy.zzz.ff:54321   1        FJ-123-1  INFO water.default: Locking cloud to new members, because Class Id=52
05-22 12:04:10.128 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:?]
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:58) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:50) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:459) ~[?:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:609) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:06:24.730 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:08:39.880 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:10:55.048 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:13:10.216 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:15:25.385 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:17:40.552 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:19:55.720 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:22:10.888 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:24:26.056 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:26:41.224 xx.yyy.zzz.ff:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]

3rd:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by ai.h2o.xgboost4j.java.NativeLibLoader (file:/opt/h2oai/h2o-3/h2o.jar) to field java.lang.ClassLoader.usr_paths
WARNING: Please consider reporting this to the maintainers of ai.h2o.xgboost4j.java.NativeLibLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
09:28:12.424 [main] WARN  hex.tree.xgboost.util.NativeLibrary - Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
09:28:12.426 [main] WARN  hex.tree.xgboost.util.NativeLibrary - Failed to load library from both native path and jar!
09:28:12.427 [main] INFO  hex.tree.xgboost.util.NativeLibraryLoaderChain - Cannot load library: xgboost4j_gpu (lib/linux_64/libxgboost4j_gpu.so)
09:28:12.450 [main] INFO  hex.tree.xgboost.util.NativeLibrary - Loaded library from lib/linux_64/libxgboost4j_minimal.so (/tmp/libxgboost4j_minimal15626617736704354506.so)
09:28:12.647 [main] INFO  water.k8s.H2OCluster - Starting Kubernetes-related REST API services
09:28:12.664 [main] INFO  water.k8s.H2OCluster - Kubernetes REST API services successfully started.
09:28:12.664 [main] INFO  water.k8s.H2OCluster - Initializing H2O Kubernetes cluster
09:28:12.664 [main] INFO  water.k8s.H2OCluster - Timeout contraint: 180 seconds.
09:28:12.664 [main] INFO  water.k8s.H2OCluster - Cluster size constraint: 2 nodes.
09:28:12.678 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - Timeout for node discovery is set to 180 seconds.
09:28:12.678 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - Desired cluster size is set to 2 nodes.
09:28:12.693 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.cc' discovered.
09:28:12.694 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.aa' discovered.
09:28:12.694 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.bb' discovered.
09:28:44.721 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.dd' discovered.
09:28:44.721 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-4.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.ee' discovered.
09:29:12.756 [main] ERROR water.k8s.lookup.KubernetesDnsLookup - Unknown host for IP Address: h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local.
09:29:42.780 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.ff' discovered.
09:31:12.870 [main] INFO  water.k8s.H2OCluster - Using the following pods to form H2O cluster: [xx.yyy.zzz.bb,xx.yyy.zzz.ee,xx.yyy.zzz.aa,xx.yyy.zzz.dd,xx.yyy.zzz.cc,xx.yyy.zzz.ff]
2022-05-22 09:31:12.925:INFO::main: Logging initialized @181784ms to org.eclipse.jetty.util.log.StdErrLog
05-22 09:31:13.087 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Dynamically loaded 'water.k8s.KubernetesEmbeddedConfigProvider' as AbstractEmbeddedH2OConfigProvider.
05-22 09:31:13.088 xx.yyy.zzz.dd:54321   1            main  INFO water.default: ----- H2O started  -----
05-22 09:31:13.088 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Build git branch: rel-zorn
05-22 09:31:13.088 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Build git hash: 717d8bf831d5d6b0decda9c37a2a20de9a491754
05-22 09:31:13.088 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Build git describe: jenkins-3.36.0.2-53-g717d8bf
05-22 09:31:13.089 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Build project version: 3.36.0.3
05-22 09:31:13.089 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Build age: 3 months and 5 days
05-22 09:31:13.089 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Built by: 'jenkins'
05-22 09:31:13.089 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Built on: '2022-02-16 17:51:32'
05-22 09:31:13.090 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Found H2O Core extensions: [StackTraceCollector, XGBoost, KrbStandalone, Infogram]
05-22 09:31:13.090 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Processed H2O arguments: []
05-22 09:31:13.090 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Java availableProcessors: 1
05-22 09:31:13.091 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Java heap totalMemory: 203.0 MB
05-22 09:31:13.091 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Java heap maxMemory: 6.32 GB
05-22 09:31:13.091 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Java version: Java 11.0.14 (from Red Hat, Inc.)
05-22 09:31:13.091 xx.yyy.zzz.dd:54321   1            main  INFO water.default: JVM launch parameters: [-XX:+UseContainerSupport, -XX:MaxRAMPercentage=50]
05-22 09:31:13.092 xx.yyy.zzz.dd:54321   1            main  INFO water.default: JVM process id: 1@h2o-stateful-set-3
05-22 09:31:13.092 xx.yyy.zzz.dd:54321   1            main  INFO water.default: OS version: Linux 5.4.170+ (amd64)
05-22 09:31:13.092 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Machine physical memory: 13.07 GB
05-22 09:31:13.092 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Machine locale: en_US
05-22 09:31:13.093 xx.yyy.zzz.dd:54321   1            main  INFO water.default: X-h2o-cluster-id: 1653211691269
05-22 09:31:13.093 xx.yyy.zzz.dd:54321   1            main  INFO water.default: User name: 'root'
05-22 09:31:13.093 xx.yyy.zzz.dd:54321   1            main  INFO water.default: IPv6 stack selected: false
05-22 09:31:13.093 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Possible IP Address: eth0 (eth0), xx.yyy.zzz.dd
05-22 09:31:13.093 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Possible IP Address: lo (lo), 127.0.0.1
05-22 09:31:13.094 xx.yyy.zzz.dd:54321   1            main  INFO water.default: H2O node running in unencrypted mode.
05-22 09:31:13.095 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Internal communication uses port: 54322
05-22 09:31:13.095 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Listening for HTTP and REST traffic on http://xx.yyy.zzz.dd:54321/
05-22 09:31:13.096 xx.yyy.zzz.dd:54321   1            main  INFO water.default: H2O cloud name: 'root' on /xx.yyy.zzz.dd:54321, static configuration based on -flatfile null
05-22 09:31:13.096 xx.yyy.zzz.dd:54321   1            main  INFO water.default: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
05-22 09:31:13.096 xx.yyy.zzz.dd:54321   1            main  INFO water.default:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 [email protected]'
05-22 09:31:13.097 xx.yyy.zzz.dd:54321   1            main  INFO water.default:   2. Point your browser to http://localhost:55555
05-22 09:31:13.613 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Kerberos not configured
05-22 09:31:13.613 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Log dir: '/tmp/h2o-root/h2ologs'
05-22 09:31:13.613 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Cur dir: '/'
05-22 09:31:13.618 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Subsystem for distributed import from HTTP/HTTPS successfully initialized
05-22 09:31:13.619 xx.yyy.zzz.dd:54321   1            main  INFO water.default: HDFS subsystem successfully initialized
05-22 09:31:13.621 xx.yyy.zzz.dd:54321   1            main  INFO water.default: S3 subsystem successfully initialized
05-22 09:31:13.629 xx.yyy.zzz.dd:54321   1            main  INFO water.default: GCS subsystem successfully initialized
05-22 09:31:13.630 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Flow dir: '/root/h2oflows'
05-22 09:31:13.637 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Cloud of size 1 formed [h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321]
05-22 09:31:13.637 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Created cluster of size 1, leader node IP is 'h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd'
05-22 09:31:13.645 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
05-22 09:31:13.647 xx.yyy.zzz.dd:54321   1            main  INFO water.default: StackTraceCollector extension initialized
05-22 09:31:13.647 xx.yyy.zzz.dd:54321   1            main  INFO water.default: XGBoost extension initialized
05-22 09:31:13.648 xx.yyy.zzz.dd:54321   1            main  INFO water.default: KrbStandalone extension initialized
05-22 09:31:13.648 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Infogram extension initialized
05-22 09:31:13.648 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Registered 4 core extensions in: 1129ms
05-22 09:31:13.649 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Registered H2O core extensions: [StackTraceCollector, XGBoost, KrbStandalone, Infogram]
05-22 09:31:13.649 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Registered: 1 auth extensions in: 180420ms
05-22 09:31:13.649 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Registered Auth extensions: [water.server.LeaderNodeRequestFilter]
05-22 09:31:13.774 xx.yyy.zzz.dd:54321   1            main  INFO hex.tree.xgboost.XGBoostExtension: Found XGBoost backend with library: xgboost4j_minimal
05-22 09:31:13.774 xx.yyy.zzz.dd:54321   1            main  WARN hex.tree.xgboost.XGBoostExtension: Your system supports only minimal version of XGBoost (no GPUs, no multithreading)!
05-22 09:31:13.851 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Registered: 275 REST APIs in: 202ms
05-22 09:31:13.851 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Registered REST API extensions: [Amazon S3, XGBoost, Algos, Sparkling Water REST API Extensions, Infogram, AutoML, Core V3, TargetEncoder, Core V4]
05-22 09:31:13.942 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Registered: 330 schemas in 90ms
05-22 09:31:13.942 xx.yyy.zzz.dd:54321   1            main  INFO water.default: H2O started in 182669ms
05-22 09:31:13.942 xx.yyy.zzz.dd:54321   1            main  INFO water.default: 
05-22 09:31:13.943 xx.yyy.zzz.dd:54321   1            main  INFO water.default: Open H2O Flow in your web browser: http://xx.yyy.zzz.dd:54321
05-22 09:31:13.943 xx.yyy.zzz.dd:54321   1            main  INFO water.default: 
05-22 09:31:17.450 xx.yyy.zzz.dd:54321   1        FJ-126-1  INFO water.default: Cloud of size 3 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321, h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa:54321]
05-22 09:31:17.451 xx.yyy.zzz.dd:54321   1        FJ-126-1  INFO water.default: Created cluster of size 3, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:31:30.892 xx.yyy.zzz.dd:54321   1        FJ-126-1  INFO water.default: Cloud of size 4 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321, xx.yyy.zzz.ee/xx.yyy.zzz.ee:54321, h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa:54321]
05-22 09:31:30.892 xx.yyy.zzz.dd:54321   1        FJ-126-1  INFO water.default: Created cluster of size 4, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:32:27.049 xx.yyy.zzz.dd:54321   1        FJ-126-1  INFO water.default: Cloud of size 5 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321, xx.yyy.zzz.ee/xx.yyy.zzz.ee:54321, h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.aa:54321, h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ff:54321]
05-22 09:32:27.049 xx.yyy.zzz.dd:54321   1        FJ-126-1  INFO water.default: Created cluster of size 5, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:33:45.507 xx.yyy.zzz.dd:54321   1        FJ-123-1  INFO water.default: Locking cloud to new members, because Class Id=52
05-22 09:33:45.507 xx.yyy.zzz.dd:54321   1        FJ-123-1  WARN water.default: Flatfile entry ignored: Node xx.yyy.zzz.cc:54321 not active in this cloud. Removing it from the list.
05-22 12:20:01.832 xx.yyy.zzz.dd:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.io.IOException: Connection timed out
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:?]
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:58) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:50) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:459) ~[?:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:609) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:22:17.000 xx.yyy.zzz.dd:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:24:32.168 xx.yyy.zzz.dd:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:26:47.336 xx.yyy.zzz.dd:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]

4th:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by ai.h2o.xgboost4j.java.NativeLibLoader (file:/opt/h2oai/h2o-3/h2o.jar) to field java.lang.ClassLoader.usr_paths
WARNING: Please consider reporting this to the maintainers of ai.h2o.xgboost4j.java.NativeLibLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
09:28:23.260 [main] WARN  hex.tree.xgboost.util.NativeLibrary - Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
09:28:23.263 [main] WARN  hex.tree.xgboost.util.NativeLibrary - Failed to load library from both native path and jar!
09:28:23.263 [main] INFO  hex.tree.xgboost.util.NativeLibraryLoaderChain - Cannot load library: xgboost4j_gpu (lib/linux_64/libxgboost4j_gpu.so)
09:28:23.287 [main] INFO  hex.tree.xgboost.util.NativeLibrary - Loaded library from lib/linux_64/libxgboost4j_minimal.so (/tmp/libxgboost4j_minimal2731353376936633374.so)
09:28:23.486 [main] INFO  water.k8s.H2OCluster - Starting Kubernetes-related REST API services
09:28:23.503 [main] INFO  water.k8s.H2OCluster - Kubernetes REST API services successfully started.
09:28:23.503 [main] INFO  water.k8s.H2OCluster - Initializing H2O Kubernetes cluster
09:28:23.504 [main] INFO  water.k8s.H2OCluster - Timeout contraint: 180 seconds.
09:28:23.504 [main] INFO  water.k8s.H2OCluster - Cluster size constraint: 2 nodes.
09:28:23.518 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - Timeout for node discovery is set to 180 seconds.
09:28:23.518 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - Desired cluster size is set to 2 nodes.
09:28:23.533 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.bb' discovered.
09:28:23.533 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.cc' discovered.
09:28:23.534 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-0.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.aa' discovered.
09:28:44.551 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.dd' discovered.
09:28:44.552 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-4.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.ee' discovered.
09:29:45.610 [main] INFO  water.k8s.lookup.KubernetesDnsLookup - New H2O pod with DNS record 'h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local./xx.yyy.zzz.ff' discovered.
09:31:23.711 [main] INFO  water.k8s.H2OCluster - Using the following pods to form H2O cluster: [xx.yyy.zzz.bb,xx.yyy.zzz.ee,xx.yyy.zzz.aa,xx.yyy.zzz.dd,xx.yyy.zzz.cc,xx.yyy.zzz.ff]
2022-05-22 09:31:23.761:INFO::main: Logging initialized @181797ms to org.eclipse.jetty.util.log.StdErrLog
05-22 09:31:23.904 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Dynamically loaded 'water.k8s.KubernetesEmbeddedConfigProvider' as AbstractEmbeddedH2OConfigProvider.
05-22 09:31:23.904 xx.yyy.zzz.ee:54321   1            main  INFO water.default: ----- H2O started  -----
05-22 09:31:23.905 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Build git branch: rel-zorn
05-22 09:31:23.905 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Build git hash: 717d8bf831d5d6b0decda9c37a2a20de9a491754
05-22 09:31:23.905 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Build git describe: jenkins-3.36.0.2-53-g717d8bf
05-22 09:31:23.905 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Build project version: 3.36.0.3
05-22 09:31:23.906 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Build age: 3 months and 5 days
05-22 09:31:23.906 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Built by: 'jenkins'
05-22 09:31:23.906 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Built on: '2022-02-16 17:51:32'
05-22 09:31:23.907 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Found H2O Core extensions: [StackTraceCollector, XGBoost, KrbStandalone, Infogram]
05-22 09:31:23.907 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Processed H2O arguments: []
05-22 09:31:23.907 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Java availableProcessors: 1
05-22 09:31:23.907 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Java heap totalMemory: 203.0 MB
05-22 09:31:23.908 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Java heap maxMemory: 6.32 GB
05-22 09:31:23.908 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Java version: Java 11.0.14 (from Red Hat, Inc.)
05-22 09:31:23.908 xx.yyy.zzz.ee:54321   1            main  INFO water.default: JVM launch parameters: [-XX:+UseContainerSupport, -XX:MaxRAMPercentage=50]
05-22 09:31:23.909 xx.yyy.zzz.ee:54321   1            main  INFO water.default: JVM process id: 1@h2o-stateful-set-4
05-22 09:31:23.909 xx.yyy.zzz.ee:54321   1            main  INFO water.default: OS version: Linux 5.4.170+ (amd64)
05-22 09:31:23.909 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Machine physical memory: 13.07 GB
05-22 09:31:23.909 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Machine locale: en_US
05-22 09:31:23.910 xx.yyy.zzz.ee:54321   1            main  INFO water.default: X-h2o-cluster-id: 1653211702094
05-22 09:31:23.910 xx.yyy.zzz.ee:54321   1            main  INFO water.default: User name: 'root'
05-22 09:31:23.910 xx.yyy.zzz.ee:54321   1            main  INFO water.default: IPv6 stack selected: false
05-22 09:31:23.910 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Possible IP Address: eth0 (eth0), xx.yyy.zzz.ee
05-22 09:31:23.911 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Possible IP Address: lo (lo), 127.0.0.1
05-22 09:31:23.911 xx.yyy.zzz.ee:54321   1            main  INFO water.default: H2O node running in unencrypted mode.
05-22 09:31:23.912 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Internal communication uses port: 54322
05-22 09:31:23.912 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Listening for HTTP and REST traffic on http://xx.yyy.zzz.ee:54321/
05-22 09:31:23.913 xx.yyy.zzz.ee:54321   1            main  INFO water.default: H2O cloud name: 'root' on /xx.yyy.zzz.ee:54321, static configuration based on -flatfile null
05-22 09:31:23.913 xx.yyy.zzz.ee:54321   1            main  INFO water.default: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
05-22 09:31:23.913 xx.yyy.zzz.ee:54321   1            main  INFO water.default:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 [email protected]'
05-22 09:31:23.914 xx.yyy.zzz.ee:54321   1            main  INFO water.default:   2. Point your browser to http://localhost:55555
05-22 09:31:24.429 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Kerberos not configured
05-22 09:31:24.429 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Log dir: '/tmp/h2o-root/h2ologs'
05-22 09:31:24.430 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Cur dir: '/'
05-22 09:31:24.434 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Subsystem for distributed import from HTTP/HTTPS successfully initialized
05-22 09:31:24.435 xx.yyy.zzz.ee:54321   1            main  INFO water.default: HDFS subsystem successfully initialized
05-22 09:31:24.437 xx.yyy.zzz.ee:54321   1            main  INFO water.default: S3 subsystem successfully initialized
05-22 09:31:24.446 xx.yyy.zzz.ee:54321   1            main  INFO water.default: GCS subsystem successfully initialized
05-22 09:31:24.446 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Flow dir: '/root/h2oflows'
05-22 09:31:24.453 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Cloud of size 1 formed [h2o-stateful-set-4.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ee:54321]
05-22 09:31:24.453 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Created cluster of size 1, leader node IP is 'h2o-stateful-set-4.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ee'
05-22 09:31:24.461 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
05-22 09:31:24.462 xx.yyy.zzz.ee:54321   1            main  INFO water.default: StackTraceCollector extension initialized
05-22 09:31:24.462 xx.yyy.zzz.ee:54321   1            main  INFO water.default: XGBoost extension initialized
05-22 09:31:24.462 xx.yyy.zzz.ee:54321   1            main  INFO water.default: KrbStandalone extension initialized
05-22 09:31:24.463 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Infogram extension initialized
05-22 09:31:24.463 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Registered 4 core extensions in: 1137ms
05-22 09:31:24.463 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Registered H2O core extensions: [StackTraceCollector, XGBoost, KrbStandalone, Infogram]
05-22 09:31:24.463 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Registered: 1 auth extensions in: 180424ms
05-22 09:31:24.464 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Registered Auth extensions: [water.server.LeaderNodeRequestFilter]
05-22 09:31:24.579 xx.yyy.zzz.ee:54321   1            main  INFO hex.tree.xgboost.XGBoostExtension: Found XGBoost backend with library: xgboost4j_minimal
05-22 09:31:24.579 xx.yyy.zzz.ee:54321   1            main  WARN hex.tree.xgboost.XGBoostExtension: Your system supports only minimal version of XGBoost (no GPUs, no multithreading)!
05-22 09:31:24.659 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Registered: 275 REST APIs in: 195ms
05-22 09:31:24.659 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Registered REST API extensions: [Amazon S3, XGBoost, Algos, Sparkling Water REST API Extensions, Infogram, AutoML, Core V3, TargetEncoder, Core V4]
05-22 09:31:24.740 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Registered: 330 schemas in 81ms
05-22 09:31:24.740 xx.yyy.zzz.ee:54321   1            main  INFO water.default: H2O started in 182641ms
05-22 09:31:24.741 xx.yyy.zzz.ee:54321   1            main  INFO water.default: 
05-22 09:31:24.741 xx.yyy.zzz.ee:54321   1            main  INFO water.default: Open H2O Flow in your web browser: http://xx.yyy.zzz.ee:54321
05-22 09:31:24.741 xx.yyy.zzz.ee:54321   1            main  INFO water.default: 
05-22 09:31:30.892 xx.yyy.zzz.ee:54321   1        FJ-126-1  INFO water.default: Cloud of size 4 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321, h2o-stateful-set-4.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ee:54321, xx.yyy.zzz.aa/xx.yyy.zzz.aa:54321]
05-22 09:31:30.893 xx.yyy.zzz.ee:54321   1        FJ-126-1  INFO water.default: Created cluster of size 4, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:32:27.236 xx.yyy.zzz.ee:54321   1        FJ-126-1  INFO water.default: Cloud of size 5 formed [h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb:54321, h2o-stateful-set-3.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.dd:54321, h2o-stateful-set-4.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ee:54321, xx.yyy.zzz.aa/xx.yyy.zzz.aa:54321, h2o-stateful-set-2.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.ff:54321]
05-22 09:32:27.237 xx.yyy.zzz.ee:54321   1        FJ-126-1  INFO water.default: Created cluster of size 5, leader node IP is 'h2o-stateful-set-1.h2o-service-dummy.sparkling-water-dummy.svc.cluster.local/xx.yyy.zzz.bb'
05-22 09:33:45.507 xx.yyy.zzz.ee:54321   1        FJ-123-1  INFO water.default: Locking cloud to new members, because Class Id=52
05-22 09:33:45.507 xx.yyy.zzz.ee:54321   1        FJ-123-1  WARN water.default: Flatfile entry ignored: Node xx.yyy.zzz.cc:54321 not active in this cloud. Removing it from the list.
05-22 12:20:01.832 xx.yyy.zzz.ee:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.io.IOException: Connection timed out
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:?]
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:58) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:50) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:459) ~[?:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:609) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:22:17.000 xx.yyy.zzz.ee:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:24:32.168 xx.yyy.zzz.ee:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]
05-22 12:26:47.336 xx.yyy.zzz.ee:54321   1      9.11:54321 ERROR water.default: Got IO error when sending a batch of bytes: 
java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:483) ~[?:?]
	at sun.nio.ch.Net.connect(Net.java:472) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:692) ~[?:?]
	at water.H2ONode.openChan(H2ONode.java:496) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.openChan(H2ONode.java:634) ~[h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:608) [h2o.jar:?]
	at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:592) [h2o.jar:?]

RajatSablok avatar May 23 '22 14:05 RajatSablok

@RajatSablok the most interesting logs would be those from the node that died :) but btw. again I would suggest more cores in general

krasinski avatar May 25 '22 19:05 krasinski

OK @krasinski

We are also facing another issue. Our H2O node never died or went unhealthy, but according to the following logs, spark could not communicate with the H2O cluster. Apparently, this issue was fixed in this JIRA issue but we are still getting this error.

22/05/25 16:23:10 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool 
22/05/25 16:23:10 INFO DAGScheduler: ResultStage 7 (runJob at Writer.scala:99) finished in 551.532 s
22/05/25 16:23:10 INFO DAGScheduler: Job 5 is finished. Cancelling potential speculative or zombie tasks for this job
22/05/25 16:23:10 INFO TaskSchedulerImpl: Killing all running tasks in stage 7: Stage finished
22/05/25 16:23:10 INFO DAGScheduler: Job 5 finished: runJob at Writer.scala:99, took 551.562250 s
22/05/25 16:27:44 INFO ContextHandler: Stopped a.h.o.e.j.s.ServletContextHandler@52346010{/,null,UNAVAILABLE}
22/05/25 16:27:44 INFO AbstractConnector: Stopped ServerConnector@3844adb6{HTTP/1.1,[http/1.1]}{0.0.0.0:54321}
Exception in thread "Thread-30" ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster h2o-service-dummy.sparkling-water-dummy.svc.cluster.local:54321 - root is not reachable,
H2OContext has been closed! Please create a new H2OContext to a healthy and reachable (web enabled)
H2O cluster.
	at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:373)
Caused by: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://h2o-service-dummy.sparkling-water-dummy.svc.cluster.local:54321/ is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.query(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo(RestApiUtils.scala:32)
	at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo$(RestApiUtils.scala:30)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.getPingInfo(RestApiUtils.scala:96)
	at ai.h2o.sparkling.H2OContext.ai$h2o$sparkling$H2OContext$$getSparklingWaterHeartbeatEvent(H2OContext.scala:335)
	at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:347)
Caused by: java.net.UnknownHostException: h2o-service-dummy.sparkling-water-dummy.svc.cluster.local
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
	at java.base/java.net.HttpURLConnection.getResponseCode(Unknown Source)
	at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.checkResponseCode(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	... 13 more
Caused by: java.net.UnknownHostException: h2o-service-dummy.sparkling-water-dummy.svc.cluster.local
	at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/java.net.Socket.connect(Unknown Source)
	at java.base/sun.net.NetworkClient.doConnect(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.<init>(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
	... 24 more
22/05/25 16:29:21 INFO H2OFrame: H2O node http://h2o-service-dummy.sparkling-water-dummy.svc.cluster.local:54321/3/FinalizeFrame successfully responded for the POST.
2022-05-25 16:29:21,043 : ERROR : src.mlExecution.mlExecution : train : An error occurred while calling o106.fit.
: java.lang.RuntimeException: H2OContext has to be running.
	at ai.h2o.sparkling.H2OContext$.$anonfun$ensure$1(H2OContext.scala:416)
	at scala.Option.getOrElse(Option.scala:189)
	at ai.h2o.sparkling.H2OContext$.ensure(H2OContext.scala:416)
	at ai.h2o.sparkling.H2OFrame$.apply(H2OFrame.scala:287)
	at ai.h2o.sparkling.backend.Writer$.convert(Writer.scala:104)
	at ai.h2o.sparkling.backend.converters.SparkDataFrameConverter$.toH2OFrame(SparkDataFrameConverter.scala:60)
	at ai.h2o.sparkling.H2OContext.$anonfun$asH2OFrame$2(H2OContext.scala:167)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.withConversionDebugPrints(H2OContextExtensions.scala:86)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.withConversionDebugPrints$(H2OContextExtensions.scala:74)
	at ai.h2o.sparkling.H2OContext.withConversionDebugPrints(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.asH2OFrame(H2OContext.scala:167)
	at ai.h2o.sparkling.H2OContext.asH2OFrame(H2OContext.scala:162)
	at ai.h2o.sparkling.ml.algos.H2OAlgoCommonUtils.prepareDatasetForFitting(H2OAlgoCommonUtils.scala:88)
	at ai.h2o.sparkling.ml.algos.H2OAlgoCommonUtils.prepareDatasetForFitting$(H2OAlgoCommonUtils.scala:60)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.prepareDatasetForFitting(H2OAutoML.scala:42)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.fit(H2OAutoML.scala:85)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.fit(H2OAutoML.scala:42)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Unknown Source)

2022-05-25 16:29:21,043 : INFO : __main__ : start : AUTO ML STATUS:False
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib/python3.9/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/lib/python3.9/http/client.py", line 950, in send
    self.connect()
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7d48e87ee0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='main-py-5c0b6180fbf30bff-driver-svc.spark.svc', port=54321): Max retries exceeded with url: /4/sessions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7d48e87ee0>: Failed to establish a new connection: [Errno 111] Connection refused'))

These are the versions that we're using:

Sparkling Water version: 3.36.0.3-1-3.1
Spark version: 3.1.2
Integrated H2O version: 3.36.0.3

RajatSablok avatar May 25 '22 19:05 RajatSablok

@RajatSablok do you have the logs from the node that died?

about the second issue you mentioned - that doesn't necessarily mean that's the same issue - how is your cluster health? aren't the nodes too busy because of lack of resources?

ideally to look into the issue we would need a description on how to reproduce it

krasinski avatar May 25 '22 21:05 krasinski

@krasinski We used the following yml files for our deployment:

Stateful Set:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: h2o-stateful-set
  namespace: sparkling-water-dummy
spec:
  serviceName: h2o-service-dummy
  replicas: 2
  selector:
    matchLabels:
      app: h2o-k8s
  template:
    metadata:
      labels:
        app: h2o-k8s
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: h2o-k8s
          image: 'h2oai/sparkling-water-external-backend:3.36.0.3-1-3.1'
          resources:
            requests:
              memory: "2Gi"
          ports:
            - containerPort: 54321
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /kubernetes/isLeaderNode
              port: 8081
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 1
          env:
          - name: H2O_KUBERNETES_SERVICE_DNS
            value: h2o-service-dummy.sparkling-water-dummy.svc.cluster.local
          - name: H2O_NODE_LOOKUP_TIMEOUT
            value: '180'
          - name: H2O_NODE_EXPECTED_COUNT
            value: '2'
          - name: H2O_KUBERNETES_API_PORT
            value: '8081'

Service:

apiVersion: v1
kind: Service
metadata:
  name: h2o-service-dummy
spec:
  type: ClusterIP
  clusterIP: None
  selector:
    app: h2o-k8s
  ports:
  - protocol: TCP
    port: 54321

Can you please tell me what other information/config you'll need to reproduce it? @krasinski

Thanks in advance!

RajatSablok avatar May 25 '22 21:05 RajatSablok

Hi @RajatSablok, Can you share the events for the namespace where you created the H2O statefulset? This should tell us whether the h2o node died itself otherwise it was evicted by k8s.

kubectl get events -n sparkling-water-dummy -o wide --sort-by=.metadata.creationTimestamp

mn-mikke avatar May 26 '22 12:05 mn-mikke

closing because of no response for a long time

krasinski avatar Dec 05 '22 17:12 krasinski