sparkling-water icon indicating copy to clipboard operation
sparkling-water copied to clipboard

Error when training XGBoost or conduct target encoding on data with high cardinality features on sparkling water

Open cliu-sift opened this issue 2 years ago • 13 comments

Providing us with the observed and expected behavior definitely helps. Giving us with the following information definitively helps:

I initialized the h2o sparkling water cluster on Google Cloud Dataproc, and did some XGBoost training on some data with both categorical and numerical columns. It was fine at first, but when I train the model on a large data with some categorical features with more than 3,000,000 unique values, it will trigger the error java.lang.OutOfMemoryError: Requested array size exceeds VM limit and the entire h2o cluster crashed. The algorithm could be ran without any problem when deleting those high cardinality features. Since it's a company internal data, I'm not able to share the exact data. But the code is actually quite simple:

from pysparkling.ml import H2OXGBoost    
  estimator = H2OXGBoost(
        labelCol="label",
        ntrees = 500,
        maxDepth = 10,
        learnRate = 0.1,
        categoricalEncoding='SortByResponse',
        convertUnknownCategoricalLevelsToNa=True,
        seed = 100
    )
    pipeline = Pipeline(stages=[estimator])

Any idea on how I should get around this error? - Sparkling Water/PySparkling/RSparkling version: h2o cluster version 3.36.1.3 - Hadoop Version & Distribution: 3.2.3 - Execution mode: YARN-cluster(not quite sure about this, but I guess should be Yarn cluster) **- YARN logs in case of running on yarn. To collect such a logs you may run yarn logs -applicationId <application ID> where the application ID is displayed when Sparkling Water is started

  • H2O & Spark logs if not running on YARN. You can find these logs in Spark work directory**
Py4JJavaError: An error occurred while calling o116.fit.
: ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: H2O node http://10.192.80.233:54321/ responded with
Status code: 500 : java.lang.OutOfMemoryError: Requested array size exceeds VM limit? at java.lang.StringCoding.encode(StringCoding.java:350)? at java.lang.String.getBytes(String.java:941)? at water.util.StringUtils.bytesOf(StringUtils.java:197)? at water.api.NanoResponse.<init>(NanoResponse.java:40)? at water.api.RequestServer.serveSchema(RequestServer.java:781)? at water.api.RequestServer.serve(RequestServer.java:474)? at water.api.RequestServer.doGeneric(RequestServer.java:303)? at water.api.RequestServer.doGet(RequestServer.java:225)? at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)? at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)? at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)? at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)? at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)? at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)? at ai.h2o.org
Server error: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 500 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.lang.StringCoding.encode(StringCoding.java:350)
	at java.lang.String.getBytes(String.java:941)
	at water.util.StringUtils.bytesOf(StringUtils.java:197)
	at water.api.NanoResponse.&lt;init&gt;(NanoResponse.java:40)
	at water.api.RequestServer.serveSchema(RequestServer.java:781)
	at water.api.RequestServer.serve(RequestServer.java:474)
	at water.api.RequestServer.doGeneric(RequestServer.java:303)
	at water.api.RequestServer.doGet(RequestServer.java:225)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at water.webserver.jetty9.Jetty9ServerAdapter$LoginHandler.handle(Jetty9ServerAdapter.java:130)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:531)
	at ai.h2o.org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
	at ai.h2o.org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
	at ai.h2o.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
	at ai.h2o.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
	at ai.h2o.org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
	at ai.h2o.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
	at ai.h2o.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /3/Frames/frame_rdd_18787349126/summary. Reason:
<pre>    java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.lang.StringCoding.encode(StringCoding.java:350)
	at java.lang.String.getBytes(String.java:941)
	at water.util.StringUtils.bytesOf(StringUtils.java:197)
	at water.api.NanoResponse.&lt;init&gt;(NanoResponse.java:40)
	at water.api.RequestServer.serveSchema(RequestServer.java:781)
	at water.api.RequestServer.serve(RequestServer.java:474)
	at water.api.RequestServer.doGeneric(RequestServer.java:303)
	at water.api.RequestServer.doGet(RequestServer.java:225)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at water.webserver.jetty9.Jetty9ServerAdapter$LoginHandler.handle(Jetty9ServerAdapter.java:130)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:531)
	at ai.h2o.org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
	at ai.h2o.org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
	at ai.h2o.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
	at ai.h2o.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
	at ai.h2o.org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
	at ai.h2o.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
	at ai.h2o.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
</pre></p>
</body>
</html>

	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:414)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.H2OFrame$.checkResponseCode(H2OFrame.scala:287)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.H2OFrame$.readURLContent(H2OFrame.scala:287)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.H2OFrame$.request(H2OFrame.scala:287)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
	at ai.h2o.sparkling.H2OFrame$.query(H2OFrame.scala:287)
	at ai.h2o.sparkling.H2OFrame$.getFrame(H2OFrame.scala:342)
	at ai.h2o.sparkling.H2OFrame$.apply(H2OFrame.scala:291)
	at ai.h2o.sparkling.backend.Writer$.convert(Writer.scala:109)
	at ai.h2o.sparkling.backend.converters.SparkDataFrameConverter$.toH2OFrame(SparkDataFrameConverter.scala:77)
	at ai.h2o.sparkling.H2OContext.$anonfun$asH2OFrame$2(H2OContext.scala:176)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.withConversionDebugPrints(H2OContextExtensions.scala:86)
	at ai.h2o.sparkling.backend.utils.H2OContextExtensions.withConversionDebugPrints$(H2OContextExtensions.scala:74)
	at ai.h2o.sparkling.H2OContext.withConversionDebugPrints(H2OContext.scala:65)
	at ai.h2o.sparkling.H2OContext.asH2OFrame(H2OContext.scala:176)
	at ai.h2o.sparkling.H2OContext.asH2OFrame(H2OContext.scala:162)
	at ai.h2o.sparkling.ml.features.H2OTargetEncoder.fit(H2OTargetEncoder.scala:55)
	at ai.h2o.sparkling.ml.features.H2OTargetEncoder.fit(H2OTargetEncoder.scala:34)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)


22/08/15 21:07:36 WARN org.apache.spark.h2o.backends.internal.InternalH2OBackend: New spark executor joined the cloud, however it won't be used for the H2O computations.
Exception in thread "Thread-57" ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster 10.192.80.233:54321 - sparkling-water-root_application_1660588132711_0002 is not reachable,
H2OContext has been closed! Please create a new H2OContext to a healthy and reachable (web enabled)
H2O cluster.
	at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:382)
Caused by: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://10.192.80.233:54321/ is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.query(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo(RestApiUtils.scala:32)
	at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo$(RestApiUtils.scala:30)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.getPingInfo(RestApiUtils.scala:96)
	at ai.h2o.sparkling.H2OContext.ai$h2o$sparkling$H2OContext$$getSparklingWaterHeartbeatEvent(H2OContext.scala:344)
	at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:356)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1952)
	at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1947)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1946)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1516)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.checkResponseCode(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	... 13 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at java.net.Socket.connect(Socket.java:556)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
	at sun.net.www.http.HttpClient.New(HttpClient.java:339)
	at sun.net.www.http.HttpClient.New(HttpClient.java:357)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1572)
	... 23 more

**

  • Are you using Windows/Linux/MAC? I think the cluster should run on Linux
  • Spark & Sparkling Water configuration including the memory configuration** image

Please also provide us with the full and minimal reproducible code.

cliu-sift avatar Aug 15 '22 23:08 cliu-sift

Thanks for reporting the problem, it will be covered by the Jira ticket SW-2736.

mn-mikke avatar Aug 16 '22 12:08 mn-mikke

Cool, thank you! @mn-mikke Do you have any idea when the problem will be solved? And maybe illuminate me a little bit more on why the error occurred? I thought it was because the algorithm needs to assign some large arrays which exceeds the size limit of Java on a single machine. But since I didn't see the source code, I don't know the precise reason for sure. Any clues could be helpful. Thanks!

cliu-sift avatar Aug 17 '22 01:08 cliu-sift

By the way, I also met another error with the same code when I try to filter out some high cardinality features (cardinality > 1,000,000), but still kept the ones with cardinality < 1,000,000. Now at least I saw the h2o progress bar. But it will throw a different error. That's how it looks like: image

The error message is:

During handling of the above exception, another exception occurred:

Py4JJavaError                             Traceback (most recent call last)
Input In [13], in <cell line: 7>()
     19     pipeline = Pipeline(stages=[estimator])
     20     # Fit and export the pipeline
     21 #     model = pipeline.fit(df_train_global_filter)
---> 22     model = pipeline.fit(df_train_global_new)
     23     model.write().overwrite().save(base_opath+'sparkling_raw_feature_option2.model')
     24 df_prediction = model.transform(df_test_global_new).select(['label', F.col('detailed_prediction')['probabilities']['1.0'].alias('prob')])

File /usr/lib/spark/python/pyspark/ml/base.py:161, in Estimator.fit(self, dataset, params)
    159         return self.copy(params)._fit(dataset)
    160     else:
--> 161         return self._fit(dataset)
    162 else:
    163     raise ValueError("Params must be either a param map or a list/tuple of param maps, "
    164                      "but got %s." % type(params))

File /usr/lib/spark/python/pyspark/ml/pipeline.py:114, in Pipeline._fit(self, dataset)
    112     dataset = stage.transform(dataset)
    113 else:  # must be an Estimator
--> 114     model = stage.fit(dataset)
    115     transformers.append(model)
    116     if i < indexOfLastEstimator:

File /usr/lib/spark/python/pyspark/ml/base.py:161, in Estimator.fit(self, dataset, params)
    159         return self.copy(params)._fit(dataset)
    160     else:
--> 161         return self._fit(dataset)
    162 else:
    163     raise ValueError("Params must be either a param map or a list/tuple of param maps, "
    164                      "but got %s." % type(params))

File /usr/lib/spark/python/pyspark/ml/wrapper.py:335, in JavaEstimator._fit(self, dataset)
    334 def _fit(self, dataset):
--> 335     java_model = self._fit_java(dataset)
    336     model = self._create_model(java_model)
    337     return self._copyValues(model)

File /usr/lib/spark/python/pyspark/ml/wrapper.py:332, in JavaEstimator._fit_java(self, dataset)
    318 """
    319 Fits a Java model to the input dataset.
    320 
   (...)
    329     fitted Java model
    330 """
    331 self._transfer_params_to_java()
--> 332 return self._java_obj.fit(dataset._jdf)

File /opt/conda/miniconda3/lib/python3.8/site-packages/py4j/java_gateway.py:1304, in JavaMember.__call__(self, *args)
   1298 command = proto.CALL_COMMAND_NAME +\
   1299     self.command_header +\
   1300     args_command +\
   1301     proto.END_COMMAND_PART
   1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
   1305     answer, self.gateway_client, self.target_id, self.name)
   1307 for temp_arg in temp_args:
   1308     temp_arg._detach()

File /usr/lib/spark/python/pyspark/sql/utils.py:111, in capture_sql_exception.<locals>.deco(*a, **kw)
    109 def deco(*a, **kw):
    110     try:
--> 111         return f(*a, **kw)
    112     except py4j.protocol.Py4JJavaError as e:
    113         converted = convert_exception(e.java_exception)

File /opt/conda/miniconda3/lib/python3.8/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o181.fit.
: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://10.192.80.25:54321/ is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.H2OJob$.readURLContent(H2OJob.scala:54)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.H2OJob$.request(H2OJob.scala:54)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
	at ai.h2o.sparkling.backend.H2OJob$.query(H2OJob.scala:54)
	at ai.h2o.sparkling.backend.H2OJob$.ai$h2o$sparkling$backend$H2OJob$$verifyAndGetJob(H2OJob.scala:63)
	at ai.h2o.sparkling.backend.H2OJob.waitForFinishAndPrintProgress(H2OJob.scala:32)
	at ai.h2o.sparkling.ml.utils.EstimatorCommonUtils.trainAndGetDestinationKey(EstimatorCommonUtils.scala:44)
	at ai.h2o.sparkling.ml.utils.EstimatorCommonUtils.trainAndGetDestinationKey$(EstimatorCommonUtils.scala:30)
	at ai.h2o.sparkling.ml.algos.H2OEstimator.trainAndGetDestinationKey(H2OEstimator.scala:16)
	at ai.h2o.sparkling.ml.algos.H2OEstimator.trainH2OModel(H2OEstimator.scala:51)
	at ai.h2o.sparkling.ml.algos.H2OEstimator.fit(H2OEstimator.scala:36)
	at ai.h2o.sparkling.ml.algos.H2OAlgorithm.fit(H2OAlgorithm.scala:41)
	at ai.h2o.sparkling.ml.algos.H2OSupervisedAlgorithm.fit(H2OSupervisedAlgorithm.scala:57)
	at ai.h2o.sparkling.ml.algos.H2OTreeBasedSupervisedAlgorithm.fit(H2OTreeBasedSupervisedAlgorithm.scala:30)
	at ai.h2o.sparkling.ml.algos.H2OXGBoost.fit(H2OXGBoost.scala:35)
	at ai.h2o.sparkling.ml.algos.H2OXGBoost.fit(H2OXGBoost.scala:27)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.SocketException: Connection reset
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1952)
	at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1947)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1946)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1516)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.backend.H2OJob$.checkResponseCode(H2OJob.scala:54)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	... 31 more
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1595)
	... 41 more


Exception in thread "Thread-136" ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster 10.192.80.25:54321 - sparkling-water-root_application_1660078545425_0002 is not reachable,
H2OContext has been closed! Please create a new H2OContext to a healthy and reachable (web enabled)
H2O cluster.
	at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:382)
Caused by: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://10.192.80.25:54321/ is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
	at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
	at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
	at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.query(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo(RestApiUtils.scala:32)
	at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo$(RestApiUtils.scala:30)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.getPingInfo(RestApiUtils.scala:96)
	at ai.h2o.sparkling.H2OContext.ai$h2o$sparkling$H2OContext$$getSparklingWaterHeartbeatEvent(H2OContext.scala:344)
	at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:356)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1952)
	at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1947)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1946)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1516)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
	at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
	at ai.h2o.sparkling.backend.utils.RestApiUtils$.checkResponseCode(RestApiUtils.scala:96)
	at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
	... 13 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at java.net.Socket.connect(Socket.java:556)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
	at sun.net.www.http.HttpClient.New(HttpClient.java:339)
	at sun.net.www.http.HttpClient.New(HttpClient.java:357)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1572)
	... 23 more

cliu-sift avatar Aug 18 '22 01:08 cliu-sift

Hi @cliu-sift,

illuminate me a little bit more on why the error occurred?

The error doesn't happen during execution of the XGboost algorithm, but in the last step of the conversion from Spark DataFrame to H2OFrame. The last step asks H2O backend for the schema of H2O Frame with metada. The current implementation returns metadata with histogram bins for each columns. The array of histogram bins contain a value for each categorical level. So the final schema is significantly bigger and fails on serialization to a byte array. You should get the same error with a different algorithm as well. Histogram bins are in fact not need in SW, so we can get rid of them.

Do you have any idea when the problem will be solved

I will share a link to a nightly build artifact once a bug fix gets in. A proper release will go out likely in September.

mn-mikke avatar Aug 18 '22 11:08 mn-mikke

I see, thank you @mn-mikke! That makes much more senses. Please allow me to ask a follow up question. I've attached the second type of error message when I deleted partial of the features and still kept a small amount of high cardinality features. now the algorithm seems to be running (at least with the progress bar), but will eventually lost connect with the cluster as well. Do you think it's still the same problem?

cliu-sift avatar Aug 18 '22 16:08 cliu-sift

@cliu-sift Regarding the second error, I will need see logs from H2O nodes (full yarn logs) to tell you what went wrong.

yarn logs -applicationId <Application ID>

mn-mikke avatar Aug 18 '22 17:08 mn-mikke

@cliu-sift Regarding the second error, I will need see logs from H2O nodes (full yarn logs) to tell you what went wrong.

yarn logs -applicationId <Application ID>

I've extracted the full log. Please feel free to let me know if you have any questions @mn-mikke. Thank you!

cliu_h2o_yarn_logs.log

cliu-sift avatar Aug 18 '22 18:08 cliu-sift

There is a problem with serialization of XGBoostModelInfo, for some reason is too big:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.lang.StringCoding.encode(StringCoding.java:350)
	at java.lang.String.getBytes(String.java:941)
	at water.util.StringUtils.bytesOf(StringUtils.java:197)
	at water.AutoBuffer.putStr(AutoBuffer.java:1527)
	at hex.tree.xgboost.XGBoostModelInfo$Icer.write163(XGBoostModelInfo$Icer.java)
	at hex.tree.xgboost.XGBoostModelInfo$Icer.write(XGBoostModelInfo$Icer.java)
	at water.Iced.write(Iced.java:61)
	at water.AutoBuffer.put(AutoBuffer.java:781)
	at hex.tree.xgboost.matrix.FrameMatrixLoader$Icer.write167(FrameMatrixLoader$Icer.java)
	at hex.tree.xgboost.matrix.FrameMatrixLoader$Icer.write(FrameMatrixLoader$Icer.java)
	at water.Iced.write(Iced.java:61)
	at water.AutoBuffer.put(AutoBuffer.java:781)
	at hex.tree.xgboost.task.XGBoostSetupTask$Icer.write164(XGBoostSetupTask$Icer.java)
	at hex.tree.xgboost.task.XGBoostSetupTask$Icer.write(XGBoostSetupTask$Icer.java)
	at water.H2O$H2OCountedCompleter.write(H2O.java:1716)
	at water.AutoBuffer.put(AutoBuffer.java:781)
	at water.RPC.call(RPC.java:201)
	at water.MRTask.remote_compute(MRTask.java:756)
	at water.MRTask.setupLocal0(MRTask.java:716)
	at water.MRTask.dfork(MRTask.java:563)
	at water.MRTask.doAll(MRTask.java:554)
	at water.MRTask.doAllNodes(MRTask.java:568)
	at hex.tree.xgboost.task.AbstractXGBoostTask.run(AbstractXGBoostTask.java:45)
	at hex.tree.xgboost.exec.LocalXGBoostExecutor.setup(LocalXGBoostExecutor.java:98)
	at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModelImpl(XGBoost.java:452)
	at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModel(XGBoost.java:407)
	at hex.tree.xgboost.XGBoost$XGBoostDriver.computeImpl(XGBoost.java:393)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:252)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1677)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)

@valenad1 @michalkurka Any idea how to reduce size of XGBoostModelInfo?

mn-mikke avatar Aug 19 '22 14:08 mn-mikke

Hey @mn-mikke, I saw you have merged the changes to the master branch. Really appreciate it! Is there any way I could test the changes with your new code? I'm using H2O through dataproc right now. Also, any chance you could also illuminate me the potential issue happened for the second error? Seems the cluster got disconnected eventually

cliu-sift avatar Aug 22 '22 16:08 cliu-sift

By the way, I ran the code for Sparkling water GLM today on a similar data with high cardinality features, the second error happens again. So my guess is that it will happens for multiple models (XGBoost, GLM logistic, target encoding, etc) with high cardinality categorical features. Just in case this information might be helpful for you to identify the potential issue underlying the second error @mn-mikke

cliu-sift avatar Aug 23 '22 04:08 cliu-sift

Hi @cliu-sift,

Is there any way I could test the changes with your new code?

You can try this nightly build:

  • https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.0/index.html for spark 3.0
  • https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.1/index.html for spark 3.1
  • https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.2/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.2/index.html for spark 3.2

mn-mikke avatar Aug 23 '22 14:08 mn-mikke

Hi @cliu-sift,

Is there any way I could test the changes with your new code?

You can try this nightly build:

  • https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.0/index.html for spark 3.0
  • https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.1/index.html for spark 3.1
  • https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.2/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.2/index.html for spark 3.2

Does the h2o version matter for using this nightly build? Because I saw there is a new h2o version corresponding to that nightly built, but the python h2o version seems not available yet https://h2o-release.s3.amazonaws.com/h2o/master/5926/index.html

So I tried to install h2o 3.36.1.4 with the nightly built sparkling Scala libraries you shared. Not sure whether it's okay or not

cliu-sift avatar Aug 24 '22 02:08 cliu-sift

@mn-mikke I think your changes worked to some extent, as now I was able to get the h2o progress bar, but it will eventually failed the whole cluster crashed (like the second error I mentioned before). image The full yarn log for that application is attached: cliu_h2o_yarn_logs2.log By the way, if it's helpful, I can try to generate a "fake" data with simulated values for each column to you. But it might be very large if I save it to csv file.

cliu-sift avatar Aug 24 '22 04:08 cliu-sift

closing the issue due to inactivity for a long time, please reopen if still needed

krasinski avatar Apr 17 '24 21:04 krasinski