ammonite-spark Use it with popular services

Hello,

It would be nice if you could you provide instructions on how to use it with AWS (AWS EMR, Flintrock on EC2) ou GCP (Google Cloud Dataproc), and how to use it from IntelliJ as well.

This could be a great CLI alternative to Zeppelin.

Sep 22 '18 19:09 mycaule

FYI, this gist lists commands to get ammonite-spark up-and-running with an EMR cluster. I'd like to make it an actual tutorial, but didn't find the time to do that yet.

Nov 13 '18 09:11 alexarchambault

cc @mpacer who was also interested by that (link in my previous comment)

Nov 13 '18 09:11 alexarchambault

Thank you very much.

The script works on my side.

I found Heather Miller's tutorial on Flintrock + S3 quite cool if one day you write a tutorial from the gist.

IMPORTANT!: Before we can go any further, we need to ensure that a handful of specific dependencies for S3 are available on our Spark cluster. As of the time of writing, this is a workaround, which can be solved by downloading a recent Hadoop 2.7.x distribution and a specific, older version of an AWS JAR (1.7.4) that is typically not available in the EC2 Maven Repository.

The hadoop-aws-2.7.2.jar and aws-java-sdk-1.7.4.jar stuff she mentions can be quite tricky, it would be nice to have a sample code with a read on Amazon S3 as well : the classpath problem looks similar to this PR https://github.com/alexarchambault/ammonite-spark/pull/7
It would be nice if you provide examples on how to read .amm files too

Nov 13 '18 09:11 mycaule

I get this error message when trying to read Parquet on S3 :

@ spark.read.parquet("s3a://bucket/path/to/parquet") 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

This import wasn't enough:

@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`

Nov 13 '18 11:11 mycaule

@mycaule Did you add the extra dependencies before creating the Spark session? (or call AmmoniteSparkSession.sync() else)

Nov 13 '18 12:11 alexarchambault

I added it after, will try this afternoon to add them before or use sync thanks.

Nov 13 '18 13:11 mycaule

After adding imports at the correct place,

@ import $ivy.`com.sun.jersey:jersey-client:1.9.1`, $ivy.`org.apache.spark::spark-sql:2.3.1`, $ivy.`sh.almond::ammonite-spark:0.1.1`
@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`
@ val spark = {
             AmmoniteSparkSession.builder()
               .progressBars()
               .master("yarn")
               .config("spark.executor.instances", "4")
               .config("spark.executor.memory", "2g")
               .getOrCreate()
           }
@ ...

... I get another error now, making progress...

I am using EMR 5.16 and using latest versions available and supported by the platform.

@ spark.read.parquet("s3a://bucket/path/to/parquet") 
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
") 
  org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
  org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
  org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
  org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
  org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
  org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
  org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
  org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
  org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
  org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
  org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
  org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)

And without S3a

@ spark.read.parquet("s3://bucket/path/to/parquet") 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)

The same command works fine in the default spark-shell available with EMR (without imports).

scala> spark.read.parquet("s3://bucket/path/to/parquet")
18/11/13 17:05:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
18/11/13 17:05:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
18/11/13 17:05:57 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = [id: int, idannonce: int ... 198 more fields]

Nov 13 '18 16:11 mycaule

To solve the Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found, two related issues that tells us to add these to the classpath

/usr/share/aws/emr/emrfs/lib/*
/usr/share/aws/emr/emrfs/auxlib/*
/usr/share/aws/emr/emr-metrics/lib/*
/usr/share/aws/emr/emrfs/conf

or 

[
  {
    "classification":"spark-defaults",
    "properties": {
      "spark.executor.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
      "spark.driver.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
    }
  }
]

https://github.com/spark-notebook/spark-notebook/issues/368 https://forums.aws.amazon.com/thread.jspa?messageID=699917

https://github.com/alexarchambault/ammonite-spark/blob/develop/INTERNALS.md

Nov 16 '18 07:11 mycaule

@mycaule I am getting the same java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong error when trying to access S3 from spark notebook.

I think its a spark/hadoop version issue though.

Mar 26 '20 14:03 kyprifog

Downgrading to

import $ivy.`org.apache.hadoop:hadoop-aws:2.7.5`

fixed the issue for me (using spark 2.4.2)

Mar 26 '20 14:03 kyprifog

ammonite-spark ammonite-spark copied to clipboard

Use it with popular services

ammonite-spark
ammonite-spark copied to clipboard