spark-on-lambda Compiling

Hello, I'm trying to install spark on lambda. When I run

./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -Dhadoop.version=2.6.0-qds-0.4.13 -DskipTests

The Project Launcher fails and I get the following error.

[ERROR] Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.1.0: Failure to find com.hadoop.gplcompression:hadoop-lzo:jar:0.4.19 in https://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]

I tried to explicitly add hadoop-lzo as a dependency in the launcher pom.xml, but I still get the same error. Is there something I need to download or change to get this to work?

Thanks!

Jul 18 '18 23:07 saj9191

Hi saj9191,

It seems like something changed in our side where we keep the maven artifacts, we'll fix it and update you here. Thanks for trying it out. Sorry for the inconvenience.

Jul 20 '18 03:07 venkata91

I am also having the same issue (also tried adding hadoop-lzo dependency manually to pom.xml with no success). Have there been any updates on resolving this issue?

Sep 02 '18 21:09 faromero

We were also hitting this issue recently. I will get back with a fix soon and post it here. Thanks for taking your time to try it out.

Sep 04 '18 17:09 venkata91

I believe I have found a solution: In spark-on-lambda/common/network-common/pom.xml, add the following dependency (as suggested previously):

<dependency>
  <groupId>com.hadoop.gplcompression</groupId>
  <artifactId>hadoop-lzo</artifactId>
  <version>0.4.19</version>
</dependency>

Then, in spark-on-lambda/pom.xml, add the following repository (which "houses" hadoop-lzo):

<repository>
  <id>twitter</id>
  <name>Twitter Repository</name>
  <url>http://maven.twttr.com</url>
</repository>

After this, I ran the make-distribution.sh command from your README and was able to build it all the way through.

Sep 04 '18 17:09 faromero

Nice workaround! Let me also try it and update it.

Sep 04 '18 17:09 venkata91

Also may I know your use case for which you are trying it out or do you want to just try it out?

Sep 04 '18 17:09 venkata91

Thanks for working to update it!

We are working on a research project associated with using Lambda for what we call "interactive massively parallel" applications, and wanted to compare Spark-on-Lambda to current state-of-the-art, as well as our work!

By the way, from your blog post, do you have the data available that you use for sorting 100GB in under 10 minutes?

Sep 04 '18 17:09 faromero

Interesting! Can you please elaborate a bit more on that? Btw the data is generated using Teragen utility from https://github.com/ehiggs/spark-terasort which you can use to generate the data.

Sep 04 '18 17:09 venkata91

You can view our work here: we call it gg, and while it was originally intended for compilation, it now supports general purpose applications (as simple as sorting and as complex as video encoding). Let me know if you have any questions about it (can be in a different forum instead of this issue thread)

I will try to run your sorting example and let you know if I have any issues!

Sep 04 '18 17:09 faromero

Another easier workaround is to remove the pom.xml additions basically reverting the commit "Fix pom.xml to have the other Qubole repository location having 2.6.0... (2ca6c68ed5)"

Build your package using this command - ./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -DskipTests

And finally add the below jars to classpath before starting spark-shell

1. wget http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
2. wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar

Refer here - https://markobigdata.com/2017/04/23/manipulating-files-from-s3-with-apache-spark/

Oct 20 '18 05:10 venkata91

hi, venkata91, I wrote you an email. I'm looking for an advisor for my startup. It is a spark-based web scraping service. The idea is to use this serverless computation but I'm having problems. As soon as you have time I would like to deepen it.

May 27 '19 20:05 webroboteu

spark-on-lambda spark-on-lambda copied to clipboard

Compiling

spark-on-lambda
spark-on-lambda copied to clipboard