sparkling-water
sparkling-water copied to clipboard
Error when training XGBoost or conduct target encoding on data with high cardinality features on sparkling water
Providing us with the observed and expected behavior definitely helps. Giving us with the following information definitively helps:
I initialized the h2o sparkling water cluster on Google Cloud Dataproc, and did some XGBoost training on some data with both categorical and numerical columns. It was fine at first, but when I train the model on a large data with some categorical features with more than 3,000,000 unique values, it will trigger the error
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
and the entire h2o cluster crashed. The algorithm could be ran without any problem when deleting those high cardinality features. Since it's a company internal data, I'm not able to share the exact data. But the code is actually quite simple:
from pysparkling.ml import H2OXGBoost
estimator = H2OXGBoost(
labelCol="label",
ntrees = 500,
maxDepth = 10,
learnRate = 0.1,
categoricalEncoding='SortByResponse',
convertUnknownCategoricalLevelsToNa=True,
seed = 100
)
pipeline = Pipeline(stages=[estimator])
Any idea on how I should get around this error?
- Sparkling Water/PySparkling/RSparkling version: h2o cluster version 3.36.1.3
- Hadoop Version & Distribution: 3.2.3
- Execution mode: YARN-cluster
(not quite sure about this, but I guess should be Yarn cluster)
**- YARN logs in case of running on yarn. To collect such a logs you may run yarn logs -applicationId <application ID>
where the application ID is displayed when Sparkling Water is started
- H2O & Spark logs if not running on YARN. You can find these logs in Spark work directory**
Py4JJavaError: An error occurred while calling o116.fit.
: ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: H2O node http://10.192.80.233:54321/ responded with
Status code: 500 : java.lang.OutOfMemoryError: Requested array size exceeds VM limit? at java.lang.StringCoding.encode(StringCoding.java:350)? at java.lang.String.getBytes(String.java:941)? at water.util.StringUtils.bytesOf(StringUtils.java:197)? at water.api.NanoResponse.<init>(NanoResponse.java:40)? at water.api.RequestServer.serveSchema(RequestServer.java:781)? at water.api.RequestServer.serve(RequestServer.java:474)? at water.api.RequestServer.doGeneric(RequestServer.java:303)? at water.api.RequestServer.doGet(RequestServer.java:225)? at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)? at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)? at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)? at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)? at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)? at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)? at ai.h2o.org
Server error: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 500 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at water.util.StringUtils.bytesOf(StringUtils.java:197)
at water.api.NanoResponse.<init>(NanoResponse.java:40)
at water.api.RequestServer.serveSchema(RequestServer.java:781)
at water.api.RequestServer.serve(RequestServer.java:474)
at water.api.RequestServer.doGeneric(RequestServer.java:303)
at water.api.RequestServer.doGet(RequestServer.java:225)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at water.webserver.jetty9.Jetty9ServerAdapter$LoginHandler.handle(Jetty9ServerAdapter.java:130)
at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:531)
at ai.h2o.org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at ai.h2o.org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at ai.h2o.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at ai.h2o.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at ai.h2o.org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at ai.h2o.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at ai.h2o.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /3/Frames/frame_rdd_18787349126/summary. Reason:
<pre> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at water.util.StringUtils.bytesOf(StringUtils.java:197)
at water.api.NanoResponse.<init>(NanoResponse.java:40)
at water.api.RequestServer.serveSchema(RequestServer.java:781)
at water.api.RequestServer.serve(RequestServer.java:474)
at water.api.RequestServer.doGeneric(RequestServer.java:303)
at water.api.RequestServer.doGet(RequestServer.java:225)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at water.webserver.jetty9.Jetty9ServerAdapter$LoginHandler.handle(Jetty9ServerAdapter.java:130)
at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:531)
at ai.h2o.org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at ai.h2o.org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at ai.h2o.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at ai.h2o.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at ai.h2o.org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at ai.h2o.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at ai.h2o.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
</pre></p>
</body>
</html>
at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:414)
at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
at ai.h2o.sparkling.H2OFrame$.checkResponseCode(H2OFrame.scala:287)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
at ai.h2o.sparkling.H2OFrame$.readURLContent(H2OFrame.scala:287)
at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
at ai.h2o.sparkling.H2OFrame$.request(H2OFrame.scala:287)
at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
at ai.h2o.sparkling.H2OFrame$.query(H2OFrame.scala:287)
at ai.h2o.sparkling.H2OFrame$.getFrame(H2OFrame.scala:342)
at ai.h2o.sparkling.H2OFrame$.apply(H2OFrame.scala:291)
at ai.h2o.sparkling.backend.Writer$.convert(Writer.scala:109)
at ai.h2o.sparkling.backend.converters.SparkDataFrameConverter$.toH2OFrame(SparkDataFrameConverter.scala:77)
at ai.h2o.sparkling.H2OContext.$anonfun$asH2OFrame$2(H2OContext.scala:176)
at ai.h2o.sparkling.backend.utils.H2OContextExtensions.withConversionDebugPrints(H2OContextExtensions.scala:86)
at ai.h2o.sparkling.backend.utils.H2OContextExtensions.withConversionDebugPrints$(H2OContextExtensions.scala:74)
at ai.h2o.sparkling.H2OContext.withConversionDebugPrints(H2OContext.scala:65)
at ai.h2o.sparkling.H2OContext.asH2OFrame(H2OContext.scala:176)
at ai.h2o.sparkling.H2OContext.asH2OFrame(H2OContext.scala:162)
at ai.h2o.sparkling.ml.features.H2OTargetEncoder.fit(H2OTargetEncoder.scala:55)
at ai.h2o.sparkling.ml.features.H2OTargetEncoder.fit(H2OTargetEncoder.scala:34)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
22/08/15 21:07:36 WARN org.apache.spark.h2o.backends.internal.InternalH2OBackend: New spark executor joined the cloud, however it won't be used for the H2O computations.
Exception in thread "Thread-57" ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster 10.192.80.233:54321 - sparkling-water-root_application_1660588132711_0002 is not reachable,
H2OContext has been closed! Please create a new H2OContext to a healthy and reachable (web enabled)
H2O cluster.
at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:382)
Caused by: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://10.192.80.233:54321/ is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.query(RestApiUtils.scala:96)
at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo(RestApiUtils.scala:32)
at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo$(RestApiUtils.scala:30)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.getPingInfo(RestApiUtils.scala:96)
at ai.h2o.sparkling.H2OContext.ai$h2o$sparkling$H2OContext$$getSparklingWaterHeartbeatEvent(H2OContext.scala:344)
at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:356)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1952)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1947)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1946)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1516)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.checkResponseCode(RestApiUtils.scala:96)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
... 13 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at java.net.Socket.connect(Socket.java:556)
at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at sun.net.www.http.HttpClient.New(HttpClient.java:339)
at sun.net.www.http.HttpClient.New(HttpClient.java:357)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1572)
... 23 more
**
- Are you using Windows/Linux/MAC? I think the cluster should run on Linux
- Spark & Sparkling Water configuration including the memory configuration**
Please also provide us with the full and minimal reproducible code.
Thanks for reporting the problem, it will be covered by the Jira ticket SW-2736.
Cool, thank you! @mn-mikke Do you have any idea when the problem will be solved? And maybe illuminate me a little bit more on why the error occurred? I thought it was because the algorithm needs to assign some large arrays which exceeds the size limit of Java on a single machine. But since I didn't see the source code, I don't know the precise reason for sure. Any clues could be helpful. Thanks!
By the way, I also met another error with the same code when I try to filter out some high cardinality features (cardinality > 1,000,000), but still kept the ones with cardinality < 1,000,000. Now at least I saw the h2o progress bar. But it will throw a different error.
That's how it looks like:
The error message is:
During handling of the above exception, another exception occurred:
Py4JJavaError Traceback (most recent call last)
Input In [13], in <cell line: 7>()
19 pipeline = Pipeline(stages=[estimator])
20 # Fit and export the pipeline
21 # model = pipeline.fit(df_train_global_filter)
---> 22 model = pipeline.fit(df_train_global_new)
23 model.write().overwrite().save(base_opath+'sparkling_raw_feature_option2.model')
24 df_prediction = model.transform(df_test_global_new).select(['label', F.col('detailed_prediction')['probabilities']['1.0'].alias('prob')])
File /usr/lib/spark/python/pyspark/ml/base.py:161, in Estimator.fit(self, dataset, params)
159 return self.copy(params)._fit(dataset)
160 else:
--> 161 return self._fit(dataset)
162 else:
163 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
164 "but got %s." % type(params))
File /usr/lib/spark/python/pyspark/ml/pipeline.py:114, in Pipeline._fit(self, dataset)
112 dataset = stage.transform(dataset)
113 else: # must be an Estimator
--> 114 model = stage.fit(dataset)
115 transformers.append(model)
116 if i < indexOfLastEstimator:
File /usr/lib/spark/python/pyspark/ml/base.py:161, in Estimator.fit(self, dataset, params)
159 return self.copy(params)._fit(dataset)
160 else:
--> 161 return self._fit(dataset)
162 else:
163 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
164 "but got %s." % type(params))
File /usr/lib/spark/python/pyspark/ml/wrapper.py:335, in JavaEstimator._fit(self, dataset)
334 def _fit(self, dataset):
--> 335 java_model = self._fit_java(dataset)
336 model = self._create_model(java_model)
337 return self._copyValues(model)
File /usr/lib/spark/python/pyspark/ml/wrapper.py:332, in JavaEstimator._fit_java(self, dataset)
318 """
319 Fits a Java model to the input dataset.
320
(...)
329 fitted Java model
330 """
331 self._transfer_params_to_java()
--> 332 return self._java_obj.fit(dataset._jdf)
File /opt/conda/miniconda3/lib/python3.8/site-packages/py4j/java_gateway.py:1304, in JavaMember.__call__(self, *args)
1298 command = proto.CALL_COMMAND_NAME +\
1299 self.command_header +\
1300 args_command +\
1301 proto.END_COMMAND_PART
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1307 for temp_arg in temp_args:
1308 temp_arg._detach()
File /usr/lib/spark/python/pyspark/sql/utils.py:111, in capture_sql_exception.<locals>.deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
File /opt/conda/miniconda3/lib/python3.8/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
Py4JJavaError: An error occurred while calling o181.fit.
: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://10.192.80.25:54321/ is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
at ai.h2o.sparkling.backend.H2OJob$.readURLContent(H2OJob.scala:54)
at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
at ai.h2o.sparkling.backend.H2OJob$.request(H2OJob.scala:54)
at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
at ai.h2o.sparkling.backend.H2OJob$.query(H2OJob.scala:54)
at ai.h2o.sparkling.backend.H2OJob$.ai$h2o$sparkling$backend$H2OJob$$verifyAndGetJob(H2OJob.scala:63)
at ai.h2o.sparkling.backend.H2OJob.waitForFinishAndPrintProgress(H2OJob.scala:32)
at ai.h2o.sparkling.ml.utils.EstimatorCommonUtils.trainAndGetDestinationKey(EstimatorCommonUtils.scala:44)
at ai.h2o.sparkling.ml.utils.EstimatorCommonUtils.trainAndGetDestinationKey$(EstimatorCommonUtils.scala:30)
at ai.h2o.sparkling.ml.algos.H2OEstimator.trainAndGetDestinationKey(H2OEstimator.scala:16)
at ai.h2o.sparkling.ml.algos.H2OEstimator.trainH2OModel(H2OEstimator.scala:51)
at ai.h2o.sparkling.ml.algos.H2OEstimator.fit(H2OEstimator.scala:36)
at ai.h2o.sparkling.ml.algos.H2OAlgorithm.fit(H2OAlgorithm.scala:41)
at ai.h2o.sparkling.ml.algos.H2OSupervisedAlgorithm.fit(H2OSupervisedAlgorithm.scala:57)
at ai.h2o.sparkling.ml.algos.H2OTreeBasedSupervisedAlgorithm.fit(H2OTreeBasedSupervisedAlgorithm.scala:30)
at ai.h2o.sparkling.ml.algos.H2OXGBoost.fit(H2OXGBoost.scala:35)
at ai.h2o.sparkling.ml.algos.H2OXGBoost.fit(H2OXGBoost.scala:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.SocketException: Connection reset
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1952)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1947)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1946)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1516)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
at ai.h2o.sparkling.backend.H2OJob$.checkResponseCode(H2OJob.scala:54)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
... 31 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1595)
... 41 more
Exception in thread "Thread-136" ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster 10.192.80.25:54321 - sparkling-water-root_application_1660078545425_0002 is not reachable,
H2OContext has been closed! Please create a new H2OContext to a healthy and reachable (web enabled)
H2O cluster.
at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:382)
Caused by: ai.h2o.sparkling.backend.exceptions.RestApiNotReachableException: H2O node http://10.192.80.25:54321/ is not reachable.
Please verify that you are passing ip and port of existing cluster node and the cluster
is running with web enabled.
at ai.h2o.sparkling.backend.utils.RestCommunication.throwRestApiNotReachableException(RestCommunication.scala:433)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:390)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent$(RestCommunication.scala:370)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.readURLContent(RestApiUtils.scala:96)
at ai.h2o.sparkling.backend.utils.RestCommunication.request(RestCommunication.scala:182)
at ai.h2o.sparkling.backend.utils.RestCommunication.request$(RestCommunication.scala:172)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.request(RestApiUtils.scala:96)
at ai.h2o.sparkling.backend.utils.RestCommunication.query(RestCommunication.scala:67)
at ai.h2o.sparkling.backend.utils.RestCommunication.query$(RestCommunication.scala:59)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.query(RestApiUtils.scala:96)
at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo(RestApiUtils.scala:32)
at ai.h2o.sparkling.backend.utils.RestApiUtils.getPingInfo$(RestApiUtils.scala:30)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.getPingInfo(RestApiUtils.scala:96)
at ai.h2o.sparkling.H2OContext.ai$h2o$sparkling$H2OContext$$getSparklingWaterHeartbeatEvent(H2OContext.scala:344)
at ai.h2o.sparkling.H2OContext$$anon$2.run(H2OContext.scala:356)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1952)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1947)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1946)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1516)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at ai.h2o.sparkling.backend.utils.RestCommunication.$anonfun$checkResponseCode$1(RestCommunication.scala:398)
at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at ai.h2o.sparkling.backend.utils.RestCommunication.retry(RestCommunication.scala:439)
at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode(RestCommunication.scala:398)
at ai.h2o.sparkling.backend.utils.RestCommunication.checkResponseCode$(RestCommunication.scala:394)
at ai.h2o.sparkling.backend.utils.RestApiUtils$.checkResponseCode(RestApiUtils.scala:96)
at ai.h2o.sparkling.backend.utils.RestCommunication.readURLContent(RestCommunication.scala:386)
... 13 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at java.net.Socket.connect(Socket.java:556)
at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at sun.net.www.http.HttpClient.New(HttpClient.java:339)
at sun.net.www.http.HttpClient.New(HttpClient.java:357)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1572)
... 23 more
Hi @cliu-sift,
illuminate me a little bit more on why the error occurred?
The error doesn't happen during execution of the XGboost
algorithm, but in the last step of the conversion from Spark DataFrame
to H2OFrame
. The last step asks H2O backend for the schema of H2O Frame with metada. The current implementation returns metadata with histogram bins for each columns. The array of histogram bins contain a value for each categorical level. So the final schema is significantly bigger and fails on serialization to a byte array. You should get the same error with a different algorithm as well. Histogram bins are in fact not need in SW, so we can get rid of them.
Do you have any idea when the problem will be solved
I will share a link to a nightly build artifact once a bug fix gets in. A proper release will go out likely in September.
I see, thank you @mn-mikke! That makes much more senses. Please allow me to ask a follow up question. I've attached the second type of error message when I deleted partial of the features and still kept a small amount of high cardinality features. now the algorithm seems to be running (at least with the progress bar), but will eventually lost connect with the cluster as well. Do you think it's still the same problem?
@cliu-sift Regarding the second error, I will need see logs from H2O nodes (full yarn logs) to tell you what went wrong.
yarn logs -applicationId <Application ID>
@cliu-sift Regarding the second error, I will need see logs from H2O nodes (full yarn logs) to tell you what went wrong.
yarn logs -applicationId <Application ID>
I've extracted the full log. Please feel free to let me know if you have any questions @mn-mikke. Thank you!
There is a problem with serialization of XGBoostModelInfo
, for some reason is too big:
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at water.util.StringUtils.bytesOf(StringUtils.java:197)
at water.AutoBuffer.putStr(AutoBuffer.java:1527)
at hex.tree.xgboost.XGBoostModelInfo$Icer.write163(XGBoostModelInfo$Icer.java)
at hex.tree.xgboost.XGBoostModelInfo$Icer.write(XGBoostModelInfo$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:781)
at hex.tree.xgboost.matrix.FrameMatrixLoader$Icer.write167(FrameMatrixLoader$Icer.java)
at hex.tree.xgboost.matrix.FrameMatrixLoader$Icer.write(FrameMatrixLoader$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:781)
at hex.tree.xgboost.task.XGBoostSetupTask$Icer.write164(XGBoostSetupTask$Icer.java)
at hex.tree.xgboost.task.XGBoostSetupTask$Icer.write(XGBoostSetupTask$Icer.java)
at water.H2O$H2OCountedCompleter.write(H2O.java:1716)
at water.AutoBuffer.put(AutoBuffer.java:781)
at water.RPC.call(RPC.java:201)
at water.MRTask.remote_compute(MRTask.java:756)
at water.MRTask.setupLocal0(MRTask.java:716)
at water.MRTask.dfork(MRTask.java:563)
at water.MRTask.doAll(MRTask.java:554)
at water.MRTask.doAllNodes(MRTask.java:568)
at hex.tree.xgboost.task.AbstractXGBoostTask.run(AbstractXGBoostTask.java:45)
at hex.tree.xgboost.exec.LocalXGBoostExecutor.setup(LocalXGBoostExecutor.java:98)
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModelImpl(XGBoost.java:452)
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModel(XGBoost.java:407)
at hex.tree.xgboost.XGBoost$XGBoostDriver.computeImpl(XGBoost.java:393)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:252)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1677)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
@valenad1 @michalkurka Any idea how to reduce size of XGBoostModelInfo?
Hey @mn-mikke, I saw you have merged the changes to the master branch. Really appreciate it! Is there any way I could test the changes with your new code? I'm using H2O through dataproc right now. Also, any chance you could also illuminate me the potential issue happened for the second error? Seems the cluster got disconnected eventually
By the way, I ran the code for Sparkling water GLM today on a similar data with high cardinality features, the second error happens again. So my guess is that it will happens for multiple models (XGBoost, GLM logistic, target encoding, etc) with high cardinality categorical features. Just in case this information might be helpful for you to identify the potential issue underlying the second error @mn-mikke
Hi @cliu-sift,
Is there any way I could test the changes with your new code?
You can try this nightly build:
- https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.0/index.html for spark 3.0
- https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.1/index.html for spark 3.1
- https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.2/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.2/index.html for spark 3.2
Hi @cliu-sift,
Is there any way I could test the changes with your new code?
You can try this nightly build:
- https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.0/index.html for spark 3.0
- https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.1/index.html for spark 3.1
- https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.2/rel-3.36.1_rel-zumbo/nightly/3.36.1.5-1.5-3.2/index.html for spark 3.2
Does the h2o version matter for using this nightly build? Because I saw there is a new h2o version corresponding to that nightly built, but the python h2o version seems not available yet https://h2o-release.s3.amazonaws.com/h2o/master/5926/index.html
So I tried to install h2o 3.36.1.4 with the nightly built sparkling Scala libraries you shared. Not sure whether it's okay or not
@mn-mikke I think your changes worked to some extent, as now I was able to get the h2o progress bar, but it will eventually failed the whole cluster crashed (like the second error I mentioned before).
The full yarn log for that application is attached:
cliu_h2o_yarn_logs2.log
By the way, if it's helpful, I can try to generate a "fake" data with simulated values for each column to you. But it might be very large if I save it to csv file.
closing the issue due to inactivity for a long time, please reopen if still needed