emma icon indicating copy to clipboard operation
emma copied to clipboard

Size of fat jars

Open aalexandrov opened this issue 7 years ago • 5 comments

This is a general discussion question regarding the size of the fat-jars produced by the emma-spark-examples and emma-flink-examples modules.

Running

find -name '*jar' | grep -v original | grep -v nexus | xargs du -hs 

in the project root shows the following output

65M	./emma-examples/emma-examples-spark/target/emma-examples-spark-0.2-SNAPSHOT.jar
64M	./emma-examples/emma-examples-flink/target/emma-examples-flink-0.2-SNAPSHOT.jar
440K	./emma-examples/emma-examples-library/target/emma-examples-library-0.2-SNAPSHOT.jar
420K	./emma-examples/emma-examples-library/target/emma-examples-library-0.2-SNAPSHOT-tests.jar
148K	./emma-spark/target/emma-spark-0.2-SNAPSHOT.jar
148K	./emma-flink/target/emma-flink-0.2-SNAPSHOT.jar
20K	./emma-gui/target/emma-gui-0.2-SNAPSHOT.jar
56K	./emma-quickstart/target/emma-quickstart-0.2-SNAPSHOT.jar
3,7M	./emma-language/target/emma-language-0.2-SNAPSHOT.jar
3,9M	./emma-language/target/emma-language-0.2-SNAPSHOT-tests.jar

The emma-flink-examples and emma-spark-examples jars are ~65M each, which is also indicative of the expected size of any client jars binding emma-language and one of emma-flink or emma-spark in the future.

A closer in emma-spark-examples reveals the root causes (output is similar for the other one).

mvn dependency:list -DincludeScope=runtime -DoutputAbsoluteArtifactFilename=true \
  | grep '/home/alexander/.m2/repository' \
  | awk -F":compile:" '{print $2}' \
  | xargs du -hs \
  | sort -r -h \
  | sed "s|$HOME/.m2/repository/||"

The list looks as follows.

14M	org/scalanlp/breeze_2.11/0.12/breeze_2.11-0.12.jar
12M	org/scalaz/scalaz-core_2.11/7.2.7/scalaz-core_2.11-7.2.7.jar
7,0M	org/spire-math/spire_2.11/0.7.4/spire_2.11-0.7.4.jar
4,4M	org/typelevel/cats-kernel_2.11/0.9.0/cats-kernel_2.11-0.9.0.jar
3,7M	org/emmalanguage/emma-language/0.2-SNAPSHOT/emma-language-0.2-SNAPSHOT.jar
3,4M	com/chuusai/shapeless_2.11/2.3.2/shapeless_2.11-2.3.2.jar
3,3M	org/typelevel/cats-core_2.11/0.9.0/cats-core_2.11-0.9.0.jar
3,0M	org/scalacheck/scalacheck_2.11/1.13.4/scalacheck_2.11-1.13.4.jar
2,0M	org/apache/commons/commons-math3/3.4.1/commons-math3-3.4.1.jar
1,2M	org/typelevel/cats-laws_2.11/0.9.0/cats-laws_2.11-0.9.0.jar
1,2M	net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jar
1,1M	org/xerial/snappy/snappy-java/1.1.2.6/snappy-java-1.1.2.6.jar
1,0M	org/apache/parquet/parquet-jackson/1.9.0/parquet-jackson-1.9.0.jar
944K	org/apache/parquet/parquet-column/1.9.0/parquet-column-1.9.0.jar
780K	org/apache/parquet/parquet-encoding/1.9.0/parquet-encoding-1.9.0.jar
764K	org/codehaus/jackson/jackson-mapper-asl/1.9.11/jackson-mapper-asl-1.9.11.jar
748K	com/github/rwl/jtransforms/2.4.0/jtransforms-2.4.0.jar
724K	org/scalactic/scalactic_2.11/3.0.3/scalactic_2.11-3.0.3.jar
480K	log4j/log4j/1.2.17/log4j-1.2.17.jar
440K	org/emmalanguage/emma-examples-library/0.2-SNAPSHOT/emma-examples-library-0.2-SNAPSHOT.jar
384K	org/apache/parquet/parquet-format/2.3.1/parquet-format-2.3.1.jar
344K	com/univocity/univocity-parsers/2.4.1/univocity-parsers-2.4.1.jar
288K	io/spray/spray-json_2.11/1.3.3/spray-json_2.11-1.3.3.jar
280K	org/typelevel/cats-free_2.11/0.9.0/cats-free_2.11-0.9.0.jar
276K	com/typesafe/config/1.3.1/config-1.3.1.jar
268K	org/apache/parquet/parquet-hadoop/1.9.0/parquet-hadoop-1.9.0.jar
244K	io/verizon/quiver/core_2.11/5.5.14-scalaz-7.2/core_2.11-5.5.14-scalaz-7.2.jar
228K	org/codehaus/jackson/jackson-core-asl/1.9.11/jackson-core-asl-1.9.11.jar
208K	org/typelevel/cats-kernel-laws_2.11/0.9.0/cats-kernel-laws_2.11-0.9.0.jar
180K	org/scalanlp/breeze-macros_2.11/0.12/breeze-macros_2.11-0.12.jar
164K	com/github/mpilquist/simulacrum_2.11/0.10.0/simulacrum_2.11-0.10.0.jar
164K	com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar
148K	org/emmalanguage/emma-spark/0.2-SNAPSHOT/emma-spark-0.2-SNAPSHOT.jar
144K	com/github/scopt/scopt_2.11/3.5.0/scopt_2.11-3.5.0.jar
108K	com/jsuereth/scala-arm_2.11/2.0/scala-arm_2.11-2.0.jar
96K	commons-pool/commons-pool/1.5.4/commons-pool-1.5.4.jar
88K	org/spire-math/spire-macros_2.11/0.7.4/spire-macros_2.11-0.7.4.jar
72K	commons-codec/commons-codec/1.5/commons-codec-1.5.jar
44K	org/typelevel/discipline_2.11/0.7.2/discipline_2.11-0.7.2.jar
44K	org/slf4j/slf4j-api/1.7.25/slf4j-api-1.7.25.jar
44K	org/apache/parquet/parquet-common/1.9.0/parquet-common-1.9.0.jar
36K	org/typelevel/machinist_2.11/0.6.1/machinist_2.11-0.6.1.jar
24K	com/typesafe/scala-logging/scala-logging-slf4j_2.11/2.1.2/scala-logging-slf4j_2.11-2.1.2.jar
20K	net/sf/opencsv/opencsv/2.3/opencsv-2.3.jar
16K	org/scala-sbt/test-interface/1.0/test-interface-1.0.jar
12K	org/typelevel/catalysts-macros_2.11/0.0.5/catalysts-macros_2.11-0.0.5.jar
12K	org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar
8,0K	org/typelevel/cats-macros_2.11/0.9.0/cats-macros_2.11-0.9.0.jar
8,0K	com/typesafe/scala-logging/scala-logging-api_2.11/2.1.2/scala-logging-api_2.11-2.1.2.jar
4,0K	org/typelevel/macro-compat_2.11/1.1.1/macro-compat_2.11-1.1.1.jar
4,0K	org/typelevel/cats-jvm_2.11/0.9.0/cats-jvm_2.11-0.9.0.jar
4,0K	org/typelevel/cats_2.11/0.9.0/cats_2.11-0.9.0.jar
4,0K	org/typelevel/catalysts-platform_2.11/0.0.5/catalysts-platform_2.11-0.0.5.jar

It might be better to rely on the breeze version shipped with the dataflow engine rather than bundling our own. @ParkL could you check the versions bundled with Spark 2.1.0 and Flink 1.2.1?

I am not sure what to do with scalaz. It seems that we're only using it due to quiver, and I am not aware of any alternative which has smaller footprint or, say, relies on cats.

I am open for suggestions.

aalexandrov avatar May 04 '17 10:05 aalexandrov

Flink 1.2.0 bundles breeze 0.12.

Spark 2.1.0 bundles breeze 0.12 as well, but with some exclusions.

I am not sure whether those are available in the classpath when submitting a job against a running cluster.

aalexandrov avatar May 04 '17 10:05 aalexandrov

I can access Breeze in the Spark REPL, but not in the Flink REPL. I have a few questions:

  1. How should libraries available in Spark and Flink be scoped - as provided?
  2. Why are test libraries like scalacheck and cats-laws submitted with the jar?
  3. Currently quiver is only needed at compile time. Can't we exclude it from the jar?

joroKr21 avatar May 05 '17 12:05 joroKr21

  1. Yes, thinks that could be found in the Flink or Spark classpath should be marked as provided. My understanding is that those are excluded from the fat-jar built by the shade plugin.
  2. I guess that those can be found along some non-(test or provided`) path in the dependency tree.
  3. This is a great idea!

aalexandrov avatar May 05 '17 13:05 aalexandrov

Breeze is not in the Flink REPL because it's not a top level dependency in Flink (it's only listed in Flink's ML library).

aalexandrov avatar May 05 '17 13:05 aalexandrov

So MLLib and Flink-ML have a dependency on Breeze and Breeze has a dependency on Shapeless. The newest version of Breeze depends on the newest version of Shapeless, but MLLib and Flink-ML reference older versions of Breeze.

joroKr21 avatar May 08 '17 15:05 joroKr21