Ammonite icon indicating copy to clipboard operation
Ammonite copied to clipboard

spark hive support

Open schlichtanders opened this issue 7 years ago • 14 comments

adapting the spark2 example found in the repository to include spark-hive I unfortunately run into the following error

import $ivy.{
  `org.apache.spark::spark-core:2.3.0`,
  `org.apache.spark::spark-sql:2.3.0`,
  `org.apache.spark::spark-hive:2.3.0`
}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local[*]").appName("test").enableHiveSupport.getOrCreate
java.lang.NumberFormatException: For input string: "${pom"
  java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
  java.lang.Integer.parseInt(Integer.java:569)
  java.lang.Integer.parseInt(Integer.java:615)
  org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:168)
  org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
  org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
  org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
  org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
  java.lang.Class.forName0(Native Method)
  java.lang.Class.forName(Class.java:348)
  org.apache.spark.util.Utils$.classForName(Utils.scala:235)
  org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1074)
  org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:862)
  ammonite.$sess.cmd3$.<init>(cmd3.sc:1)
  ammonite.$sess.cmd3$.<clinit>(cmd3.sc)

note that nothing was executed, but the error appears when instantiating Spark with Hive support. (Same error appears for spark version 2.1.0)

versions: Ammonite Repl 1.0.5 (Scala 2.11.12 Java 1.8.0_161) CentOS 7

schlichtanders avatar Mar 08 '18 08:03 schlichtanders

Just to chime in, I've also have had no luck getting Hive to work with Ammonite as well.

AdrielVelazquez avatar May 29 '18 19:05 AdrielVelazquez

spark-hive now works fine with ammonite-spark for me.

alexarchambault avatar Jul 31 '18 09:07 alexarchambault

@alexarchambault i am using ammonite-spark, but still hit the same issue as @schlichtanders, this is on AWS EMR cluster spark-shell works fine.

{import org.apache.spark.sql._
            val spark = AmmoniteSparkSession.builder
                .progressBars()
                .master("yarn")
                .config("spark.logConf", "true")
                .config("spark.executor.instances", "4")
                .config("spark.executor.memory", "2g")
                .enableHiveSupport()
                .getOrCreate()
                }
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.lang.NumberFormatException: For input string: "${pom"
  java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
  java.lang.Integer.parseInt(Integer.java:569)
  java.lang.Integer.parseInt(Integer.java:615)
  org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:168)
  org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
  org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
  org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
  org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
  java.lang.Class.forName0(Native Method)
  java.lang.Class.forName(Class.java:348)
  org.apache.spark.util.Utils$.classForName(Utils.scala:238)
  org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1078)
  org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:865)
  ammonite.$sess.cmd0$.<init>(cmd0.sc:8)
  ammonite.$sess.cmd0$.<clinit>(cmd0.sc)

dynofu avatar Sep 12 '18 18:09 dynofu

it's like a hadoop problem.

hadoop-amm@ import org.apache.hadoop.util.VersionInfo
import org.apache.hadoop.util.VersionInfo

hadoop-amm@ var ver = VersionInfo.getVersion()
ver: String = "${pom.version}"

dynofu avatar Sep 12 '18 20:09 dynofu

ls.rec! Path("/usr/lib/spark/jars") |? { _.segments.last.endsWith(".jar") } |! { interp.load.cp(_) }
import org.apache.hadoop.util.VersionInfo
val ver = VersionInfo.getVersion()
ver: String = "2.8.3-amzn-1"

but there are still other problems...

dynofu avatar Sep 12 '18 22:09 dynofu

I was able to reproduce that on EMR… That seems to originate from Ammonite adding both main JARs and source JARs to the classpath. This results in two common-version-info.properties resources landing in the classpath, from the main and source JARs of org.apache.hadoop:hadoop-common:2.6.5. The one from the source seems to be a kind of template, hence the "${pom stuff.

For w/e reasons, I ran into that at $work, but must have changed stuff in the setup there, so that the right common-version-info.properties gets picked by chance…

alexarchambault avatar Sep 12 '18 22:09 alexarchambault

Ideally, Ammonite shouldn't blindly add source JARs to the classpath this way…

As a quick workaround though, I guess adding source JARs this way could be put behind a flag, so that it could be disabled if necessary.

alexarchambault avatar Sep 12 '18 22:09 alexarchambault

Is there any hope for this on EMR? What about import $ivy. everything? I started down this path and kept trading one exception for another, then thought maybe I was wasting my time and someone had tried all of this already. Last one was:java.lang.NoSuchMethodError: org.apache.hadoop.io.retry.RetryUtils.getDefaultRetryPolicy.

How can I help?

nbest937 avatar Oct 01 '18 21:10 nbest937

i actually made it work in our emr cluster. instead of a gist, i just publish my setup and hope it can be a bit help to anybody who are interested.

https://github.com/dyno/ammonite_with_spark_on_emr

dynofu avatar Oct 01 '18 22:10 dynofu

@alexarchambault In ammonite/runtime/tools/IvyThing.resolveArtifact is there a reason you included .addClassifiers(Classifier.sources)? Would removal of this line fix the problem?

tyrius02 avatar May 24 '19 23:05 tyrius02

Also been hit by this. My workaround is to replace the offending -sources.zip with an empty zip file:

touch empty                                                                                                                                                        
zip empty.zip empty                                                                                                                                                
find $HOME/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/ -iname "*-sources.jar" -exec  cp empty.zip "{}" \;                

Needs to be re-run if new sources are downloaded to the cache, so not perfect but very easy to do.

bnalgo avatar Jun 13 '19 08:06 bnalgo

On second thought removing that line might be a bad idea... Who knows what sort of side-effects it'll have, plus it doesn't appear there's a way to include a sources specifier in import $ivy lines. I'm turning my attention to amm/interp/src/main/scala/ammonite/interp/Interpreter.scala. It appears in there that all of the artifacts fetched by coursier are blindly added to the classpath. Somewhere in there source jars should be excluded from the cp.

tyrius02 avatar Jun 17 '19 22:06 tyrius02

FYI, with the current nightlies (and in the upcoming releases), it's possible to disable import $ivy bringing sources, with code like this one:

interp.resolutionHooks += { fetch =>
  import scala.collection.JavaConverters._
  fetch.withClassifiers(fetch.getClassifiers.asScala.filter(_ != "sources").asJava)
}

(needs to be run in a cell prior to the one doing import $ivy).

alexarchambault avatar Jul 23 '19 14:07 alexarchambault

Thanks for this workaround :) and for everybody, don't forget to add this section before and added @ line to change ammonite resolution behavior after the @ :

interp.resolutionHooks += { fetch =>
  // -- This is mandatory with drools >= 7.0.46 because drools sources artifacts also brings kie.conf (generate resources conflict)
  // -- and because by default ammonite also load sources artifact
  import scala.jdk.CollectionConverters._
  fetch.withClassifiers(fetch.getClassifiers.asScala.filter(_ != "sources").asJava)
}

@

import $ivy.`fr.janalyse::drools-scripting:1.0.14-SNAPSHOT`, $ivy.`org.scalatest::scalatest:3.2.3`

@lihaoyi it would be greate to have an alternative syntax to exclude more easily sources resolutions

dacr avatar Jan 02 '21 10:01 dacr