ammonite-spark
ammonite-spark copied to clipboard
Use it with popular services
Hello,
It would be nice if you could you provide instructions on how to use it with AWS (AWS EMR, Flintrock on EC2) ou GCP (Google Cloud Dataproc), and how to use it from IntelliJ as well.
This could be a great CLI alternative to Zeppelin.
FYI, this gist lists commands to get ammonite-spark up-and-running with an EMR cluster. I'd like to make it an actual tutorial, but didn't find the time to do that yet.
cc @mpacer who was also interested by that (link in my previous comment)
Thank you very much.
The script works on my side.
I found Heather Miller's tutorial on Flintrock + S3 quite cool if one day you write a tutorial from the gist.
IMPORTANT!: Before we can go any further, we need to ensure that a handful of specific dependencies for S3 are available on our Spark cluster. As of the time of writing, this is a workaround, which can be solved by downloading a recent Hadoop 2.7.x distribution and a specific, older version of an AWS JAR (1.7.4) that is typically not available in the EC2 Maven Repository.
- The
hadoop-aws-2.7.2.jarandaws-java-sdk-1.7.4.jarstuff she mentions can be quite tricky, it would be nice to have a sample code with a read on Amazon S3 as well : the classpath problem looks similar to this PR https://github.com/alexarchambault/ammonite-spark/pull/7 - It would be nice if you provide examples on how to read
.ammfiles too
I get this error message when trying to read Parquet on S3 :
@ spark.read.parquet("s3a://bucket/path/to/parquet")
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
This import wasn't enough:
@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`
@mycaule Did you add the extra dependencies before creating the Spark session? (or call AmmoniteSparkSession.sync() else)
I added it after, will try this afternoon to add them before or use sync thanks.
After adding imports at the correct place,
@ import $ivy.`com.sun.jersey:jersey-client:1.9.1`, $ivy.`org.apache.spark::spark-sql:2.3.1`, $ivy.`sh.almond::ammonite-spark:0.1.1`
@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`
@ val spark = {
AmmoniteSparkSession.builder()
.progressBars()
.master("yarn")
.config("spark.executor.instances", "4")
.config("spark.executor.memory", "2g")
.getOrCreate()
}
@ ...
... I get another error now, making progress...
I am using EMR 5.16 and using latest versions available and supported by the platform.
@ spark.read.parquet("s3a://bucket/path/to/parquet")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
")
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)
And without S3a
@ spark.read.parquet("s3://bucket/path/to/parquet")
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
The same command works fine in the default spark-shell available with EMR (without imports).
scala> spark.read.parquet("s3://bucket/path/to/parquet")
18/11/13 17:05:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
18/11/13 17:05:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
18/11/13 17:05:57 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = [id: int, idannonce: int ... 198 more fields]
To solve the Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found, two related issues that tells us to add these to the classpath
/usr/share/aws/emr/emrfs/lib/*
/usr/share/aws/emr/emrfs/auxlib/*
/usr/share/aws/emr/emr-metrics/lib/*
/usr/share/aws/emr/emrfs/conf
or
[
{
"classification":"spark-defaults",
"properties": {
"spark.executor.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
"spark.driver.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
}
}
]
https://github.com/spark-notebook/spark-notebook/issues/368 https://forums.aws.amazon.com/thread.jspa?messageID=699917
https://github.com/alexarchambault/ammonite-spark/blob/develop/INTERNALS.md
@mycaule I am getting the same
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong
error when trying to access S3 from spark notebook.
I think its a spark/hadoop version issue though.
Downgrading to
import $ivy.`org.apache.hadoop:hadoop-aws:2.7.5`
fixed the issue for me (using spark 2.4.2)