sparkling-water icon indicating copy to clipboard operation
sparkling-water copied to clipboard

Spark 2.4 / Scala 2.12: build does not work - Google Cloud Dataproc integration

Open bradmiro opened this issue 4 years ago • 11 comments

Providing us with the observed and expected behavior definitely helps. Giving us with the following information definitively helps:

  • Sparkling Water/PySparkling/RSparkling version: 3.30.0.7-1
  • Hadoop Version & Distribution: 2.10.0 (Dataproc 1.5-ubuntu)

Please also provide us with the full and minimal reproducible code.


I am looking to update the installation script for H2O on Google Cloud Dataproc.

The tests for this script pass on Dataproc 1.4-Ubuntu (Spark 2.4.5 / Scala 2.11) and (preview) Dataproc 2.0-ubuntu (Spark 3.0.0 / Scala 2.12).

However, I am attempting to support this on Dataproc 1.5 (Spark 2.4.5 / Scala 2.12 as the current test fails with the following error:

Traceback (most recent call last):
  File "/tmp/6c49db3c27b748509e6683b736ea968d/sample-script.py", line 6, in <module>
    hc = H2OContext.getOrCreate()
  File "/opt/conda/default/lib/python3.7/site-packages/ai/h2o/sparkling/H2OContext.py", line 89, in getOrCreate
    selected_conf = H2OConf()
  File "/opt/conda/default/lib/python3.7/site-packages/ai/h2o/sparkling/H2OConf.py", line 35, in __init__
    self._jconf = _jvm.org.apache.spark.h2o.H2OConf()
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.h2o.H2OConf.
: java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class
	at ai.h2o.sparkling.utils.SparkSessionUtils$.<init>(SparkSessionUtils.scala:32)
	at ai.h2o.sparkling.utils.SparkSessionUtils$.<clinit>(SparkSessionUtils.scala)
	at org.apache.spark.h2o.H2OConf.<init>(H2OConf.scala:55)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 14 more

My guess as to why this failed is related to the differing Scala versions and a release not available for Spark 2.4/Scala 2.12. As such, I am attempting to build sparkling-water with Scala 2.12 and Spark 2.4.5 on the cluster. I did a preliminary build to test this with the following:

git clone --branch rel-3.30 https://github.com/h2oai/sparkling-water.git
cd sparkling-water
./gradlew clean dist -PscalaBaseVersion=2.12

When I run ls ./dist/build/dist, I see a jar file was created for 3.0, not 2.4. I tried to see if this would work anyway by leveraging the following full installation code:

H20_SPARKLING_WATER_VERSION=3.30.0.7-1
local tmp_dir
tmp_dir=$(mktemp -d -t init-action-h2o-XXXX)

git clone --branch rel-3.30 https://github.com/h2oai/sparkling-water.git ${tmp_dir}/sparkling-water
${tmp_dir}/sparkling-water/gradlew -p ${tmp_dir}/sparkling-water clean dist -PscalaBaseVersion=2.12

unzip -q "${tmp_dir}/sparkling-water/dist/build/dist/sparkling-water-${H2O_SPARKLING_WATER_VERSION}-3.0.zip" -d /usr/lib/
ln -s "/usr/lib/sparkling-water-${H2O_SPARKLING_WATER_VERSION}-3.0" /usr/lib/sparkling-water

## Fix $TOPDIR variable resolution in Sparkling scripts
sed -i 's|TOPDIR=.*|TOPDIR=$(cd "$(dirname "$(readlink -f "$0")")/.."; pwd)|g' \
  /usr/lib/sparkling-water/bin/sparkling-shell \
  /usr/lib/sparkling-water/bin/pysparkling

## Create Symlink entries for default
  ln -s /usr/lib/sparkling-water/bin/sparkling-shell /usr/bin/
  ln -s /usr/lib/sparkling-water/bin/pysparkling /usr/bin/

This, however, still did not work, with the same error as above.

bradmiro avatar Jul 21 '20 18:07 bradmiro

Hi @bradmiro, If you want to build Sparkling Water on your own, I recommend to set git to point to a release tag: git checkout RELEASE-3.30.0.6-1 If you use rel-* branch, you may have troubles to build it locally since the branch might point to unreleased H2O version.

To build SW for Spark 2.4 and Scala, try to run: ./gradlew dist -PscalaBaseVersion=2.12 -Pspark=2.4

If you hit any problem, let us know.

Thanks, Marek

mn-mikke avatar Jul 22 '20 11:07 mn-mikke

Thanks for your response @mn-mikke. By following your suggestion, I do now see ./sparkling-water/dist/build/dist/sparkling-water-3.30.0.6-1-2.4.zip. However, unfortunately hc = H2OContext.getOrCreate() still generates the same error as above.

bradmiro avatar Jul 24 '20 02:07 bradmiro

@bradmiro I have tried the code and I might actually discover a bug. We don't store the scala base version onto the resulting gradle.properties file. This error, however, leads to a different error, but good to fix anyways.

You mentioned that you are trying to use Spark 2.4.5 with Sparkling Water build for Scala 2.12. Spark 2.4.5 is by default build with Scala 2.11. Are you using the default Spark or have you build your own Spark 2.4.5 with Scala 2.12?

Thanks, Kuba

jakubhava avatar Jul 24 '20 06:07 jakubhava

I have merged the fix for the bug discovered into the release branch. I'm not sure whether it caused your errors, but could you please try to build sparkling water again as:

git clone --branch rel-3.30 https://github.com/h2oai/sparkling-water.git
cd sparkling-water
./gradlew clean dist -PscalaBaseVersion=2.12

and run it?

jakubhava avatar Jul 24 '20 06:07 jakubhava

Are you using the default Spark or have you build your own Spark 2.4.5 with Scala 2.12?

@jakubhava See https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.5. From the version grid, it looks that Spark 2.4.5 is built for Scala 2.12

mn-mikke avatar Jul 24 '20 08:07 mn-mikke

Thanks @jakubhava, I tried your suggestion, both with -Pspark=2.4 and without, and unfortunately the error persists. And @mn-mikke is correct in that Spark 2.4.5 / Scala 2.12 is tied to this specific Dataproc image version (1.5).

bradmiro avatar Jul 24 '20 18:07 bradmiro

@bradmiro could you please try on the latest code with the following command:

git clone --branch rel-3.30.1 https://github.com/h2oai/sparkling-water.git
cd sparkling-water
./gradlew clean dist -PscalaBaseVersion=2.12 -Pspark=2.4 -PsparkVersion=2.4.5

Please let us know of the outcome.

I have built Spark 2.4.5 with Scala 2.12 from scratch and haven't bumped into any issue so far

jakubhava avatar Jul 28 '20 07:07 jakubhava

@bradmiro Do you still have a problem with running Sparkling Water on Dataproc 1.5 or can we close the issue?

mn-mikke avatar Aug 24 '20 15:08 mn-mikke

Hey @mn-mikke, apologies here. I just tried @jakubhava's suggestion which unfortunately didn't work. I tried with release 3.30.1.1-1 and Spark 2.4.6 which is now available on Dataproc 1.5 (up from 2.4.5). I'll keep seeing if I can find anything wrong with our setup.

bradmiro avatar Aug 24 '20 22:08 bradmiro

Hi @bradmiro, do you still have problem running Sparkling Water on Dataproc 1.5?

mn-mikke avatar Sep 22 '20 12:09 mn-mikke

Hey @mn-mikke, we couldn't end up finding a sensible solution on our side so we now just advise users to skip this version of Dataproc, which should be fine. Thanks again to you and @jakubhava for your help!

bradmiro avatar Sep 29 '20 18:09 bradmiro