sparkling-water
sparkling-water copied to clipboard
Spark 2.4 / Scala 2.12: build does not work - Google Cloud Dataproc integration
Providing us with the observed and expected behavior definitely helps. Giving us with the following information definitively helps:
- Sparkling Water/PySparkling/RSparkling version: 3.30.0.7-1
- Hadoop Version & Distribution: 2.10.0 (Dataproc 1.5-ubuntu)
Please also provide us with the full and minimal reproducible code.
I am looking to update the installation script for H2O on Google Cloud Dataproc.
The tests for this script pass on Dataproc 1.4-Ubuntu (Spark 2.4.5 / Scala 2.11) and (preview) Dataproc 2.0-ubuntu (Spark 3.0.0 / Scala 2.12).
However, I am attempting to support this on Dataproc 1.5 (Spark 2.4.5 / Scala 2.12 as the current test fails with the following error:
Traceback (most recent call last):
File "/tmp/6c49db3c27b748509e6683b736ea968d/sample-script.py", line 6, in <module>
hc = H2OContext.getOrCreate()
File "/opt/conda/default/lib/python3.7/site-packages/ai/h2o/sparkling/H2OContext.py", line 89, in getOrCreate
selected_conf = H2OConf()
File "/opt/conda/default/lib/python3.7/site-packages/ai/h2o/sparkling/H2OConf.py", line 35, in __init__
self._jconf = _jvm.org.apache.spark.h2o.H2OConf()
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.h2o.H2OConf.
: java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class
at ai.h2o.sparkling.utils.SparkSessionUtils$.<init>(SparkSessionUtils.scala:32)
at ai.h2o.sparkling.utils.SparkSessionUtils$.<clinit>(SparkSessionUtils.scala)
at org.apache.spark.h2o.H2OConf.<init>(H2OConf.scala:55)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 14 more
My guess as to why this failed is related to the differing Scala versions and a release not available for Spark 2.4/Scala 2.12. As such, I am attempting to build sparkling-water
with Scala 2.12 and Spark 2.4.5 on the cluster. I did a preliminary build to test this with the following:
git clone --branch rel-3.30 https://github.com/h2oai/sparkling-water.git
cd sparkling-water
./gradlew clean dist -PscalaBaseVersion=2.12
When I run ls ./dist/build/dist
, I see a jar file was created for 3.0, not 2.4. I tried to see if this would work anyway by leveraging the following full installation code:
H20_SPARKLING_WATER_VERSION=3.30.0.7-1
local tmp_dir
tmp_dir=$(mktemp -d -t init-action-h2o-XXXX)
git clone --branch rel-3.30 https://github.com/h2oai/sparkling-water.git ${tmp_dir}/sparkling-water
${tmp_dir}/sparkling-water/gradlew -p ${tmp_dir}/sparkling-water clean dist -PscalaBaseVersion=2.12
unzip -q "${tmp_dir}/sparkling-water/dist/build/dist/sparkling-water-${H2O_SPARKLING_WATER_VERSION}-3.0.zip" -d /usr/lib/
ln -s "/usr/lib/sparkling-water-${H2O_SPARKLING_WATER_VERSION}-3.0" /usr/lib/sparkling-water
## Fix $TOPDIR variable resolution in Sparkling scripts
sed -i 's|TOPDIR=.*|TOPDIR=$(cd "$(dirname "$(readlink -f "$0")")/.."; pwd)|g' \
/usr/lib/sparkling-water/bin/sparkling-shell \
/usr/lib/sparkling-water/bin/pysparkling
## Create Symlink entries for default
ln -s /usr/lib/sparkling-water/bin/sparkling-shell /usr/bin/
ln -s /usr/lib/sparkling-water/bin/pysparkling /usr/bin/
This, however, still did not work, with the same error as above.
Hi @bradmiro,
If you want to build Sparkling Water on your own, I recommend to set git to point to a release tag:
git checkout RELEASE-3.30.0.6-1
If you use rel-*
branch, you may have troubles to build it locally since the branch might point to unreleased H2O version.
To build SW for Spark 2.4 and Scala, try to run:
./gradlew dist -PscalaBaseVersion=2.12 -Pspark=2.4
If you hit any problem, let us know.
Thanks, Marek
Thanks for your response @mn-mikke. By following your suggestion, I do now see ./sparkling-water/dist/build/dist/sparkling-water-3.30.0.6-1-2.4.zip
. However, unfortunately hc = H2OContext.getOrCreate()
still generates the same error as above.
@bradmiro I have tried the code and I might actually discover a bug. We don't store the scala base version onto the resulting gradle.properties file. This error, however, leads to a different error, but good to fix anyways.
You mentioned that you are trying to use Spark 2.4.5 with Sparkling Water build for Scala 2.12. Spark 2.4.5 is by default build with Scala 2.11. Are you using the default Spark or have you build your own Spark 2.4.5 with Scala 2.12?
Thanks, Kuba
I have merged the fix for the bug discovered into the release branch. I'm not sure whether it caused your errors, but could you please try to build sparkling water again as:
git clone --branch rel-3.30 https://github.com/h2oai/sparkling-water.git
cd sparkling-water
./gradlew clean dist -PscalaBaseVersion=2.12
and run it?
Are you using the default Spark or have you build your own Spark 2.4.5 with Scala 2.12?
@jakubhava See https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.5. From the version grid, it looks that Spark 2.4.5 is built for Scala 2.12
Thanks @jakubhava, I tried your suggestion, both with -Pspark=2.4
and without, and unfortunately the error persists. And @mn-mikke is correct in that Spark 2.4.5 / Scala 2.12 is tied to this specific Dataproc image version (1.5).
@bradmiro could you please try on the latest code with the following command:
git clone --branch rel-3.30.1 https://github.com/h2oai/sparkling-water.git
cd sparkling-water
./gradlew clean dist -PscalaBaseVersion=2.12 -Pspark=2.4 -PsparkVersion=2.4.5
Please let us know of the outcome.
I have built Spark 2.4.5 with Scala 2.12 from scratch and haven't bumped into any issue so far
@bradmiro Do you still have a problem with running Sparkling Water on Dataproc 1.5 or can we close the issue?
Hey @mn-mikke, apologies here. I just tried @jakubhava's suggestion which unfortunately didn't work. I tried with release 3.30.1.1-1 and Spark 2.4.6 which is now available on Dataproc 1.5 (up from 2.4.5). I'll keep seeing if I can find anything wrong with our setup.
Hi @bradmiro, do you still have problem running Sparkling Water on Dataproc 1.5?
Hey @mn-mikke, we couldn't end up finding a sensible solution on our side so we now just advise users to skip this version of Dataproc, which should be fine. Thanks again to you and @jakubhava for your help!