emma
emma copied to clipboard
Size of fat jars
This is a general discussion question regarding the size of the fat-jars produced by the emma-spark-examples
and emma-flink-examples
modules.
Running
find -name '*jar' | grep -v original | grep -v nexus | xargs du -hs
in the project root shows the following output
65M ./emma-examples/emma-examples-spark/target/emma-examples-spark-0.2-SNAPSHOT.jar
64M ./emma-examples/emma-examples-flink/target/emma-examples-flink-0.2-SNAPSHOT.jar
440K ./emma-examples/emma-examples-library/target/emma-examples-library-0.2-SNAPSHOT.jar
420K ./emma-examples/emma-examples-library/target/emma-examples-library-0.2-SNAPSHOT-tests.jar
148K ./emma-spark/target/emma-spark-0.2-SNAPSHOT.jar
148K ./emma-flink/target/emma-flink-0.2-SNAPSHOT.jar
20K ./emma-gui/target/emma-gui-0.2-SNAPSHOT.jar
56K ./emma-quickstart/target/emma-quickstart-0.2-SNAPSHOT.jar
3,7M ./emma-language/target/emma-language-0.2-SNAPSHOT.jar
3,9M ./emma-language/target/emma-language-0.2-SNAPSHOT-tests.jar
The emma-flink-examples
and emma-spark-examples
jars are ~65M each, which is also indicative of the expected size of any client jars binding emma-language
and one of emma-flink
or emma-spark
in the future.
A closer in emma-spark-examples
reveals the root causes (output is similar for the other one).
mvn dependency:list -DincludeScope=runtime -DoutputAbsoluteArtifactFilename=true \
| grep '/home/alexander/.m2/repository' \
| awk -F":compile:" '{print $2}' \
| xargs du -hs \
| sort -r -h \
| sed "s|$HOME/.m2/repository/||"
The list looks as follows.
14M org/scalanlp/breeze_2.11/0.12/breeze_2.11-0.12.jar
12M org/scalaz/scalaz-core_2.11/7.2.7/scalaz-core_2.11-7.2.7.jar
7,0M org/spire-math/spire_2.11/0.7.4/spire_2.11-0.7.4.jar
4,4M org/typelevel/cats-kernel_2.11/0.9.0/cats-kernel_2.11-0.9.0.jar
3,7M org/emmalanguage/emma-language/0.2-SNAPSHOT/emma-language-0.2-SNAPSHOT.jar
3,4M com/chuusai/shapeless_2.11/2.3.2/shapeless_2.11-2.3.2.jar
3,3M org/typelevel/cats-core_2.11/0.9.0/cats-core_2.11-0.9.0.jar
3,0M org/scalacheck/scalacheck_2.11/1.13.4/scalacheck_2.11-1.13.4.jar
2,0M org/apache/commons/commons-math3/3.4.1/commons-math3-3.4.1.jar
1,2M org/typelevel/cats-laws_2.11/0.9.0/cats-laws_2.11-0.9.0.jar
1,2M net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jar
1,1M org/xerial/snappy/snappy-java/1.1.2.6/snappy-java-1.1.2.6.jar
1,0M org/apache/parquet/parquet-jackson/1.9.0/parquet-jackson-1.9.0.jar
944K org/apache/parquet/parquet-column/1.9.0/parquet-column-1.9.0.jar
780K org/apache/parquet/parquet-encoding/1.9.0/parquet-encoding-1.9.0.jar
764K org/codehaus/jackson/jackson-mapper-asl/1.9.11/jackson-mapper-asl-1.9.11.jar
748K com/github/rwl/jtransforms/2.4.0/jtransforms-2.4.0.jar
724K org/scalactic/scalactic_2.11/3.0.3/scalactic_2.11-3.0.3.jar
480K log4j/log4j/1.2.17/log4j-1.2.17.jar
440K org/emmalanguage/emma-examples-library/0.2-SNAPSHOT/emma-examples-library-0.2-SNAPSHOT.jar
384K org/apache/parquet/parquet-format/2.3.1/parquet-format-2.3.1.jar
344K com/univocity/univocity-parsers/2.4.1/univocity-parsers-2.4.1.jar
288K io/spray/spray-json_2.11/1.3.3/spray-json_2.11-1.3.3.jar
280K org/typelevel/cats-free_2.11/0.9.0/cats-free_2.11-0.9.0.jar
276K com/typesafe/config/1.3.1/config-1.3.1.jar
268K org/apache/parquet/parquet-hadoop/1.9.0/parquet-hadoop-1.9.0.jar
244K io/verizon/quiver/core_2.11/5.5.14-scalaz-7.2/core_2.11-5.5.14-scalaz-7.2.jar
228K org/codehaus/jackson/jackson-core-asl/1.9.11/jackson-core-asl-1.9.11.jar
208K org/typelevel/cats-kernel-laws_2.11/0.9.0/cats-kernel-laws_2.11-0.9.0.jar
180K org/scalanlp/breeze-macros_2.11/0.12/breeze-macros_2.11-0.12.jar
164K com/github/mpilquist/simulacrum_2.11/0.10.0/simulacrum_2.11-0.10.0.jar
164K com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar
148K org/emmalanguage/emma-spark/0.2-SNAPSHOT/emma-spark-0.2-SNAPSHOT.jar
144K com/github/scopt/scopt_2.11/3.5.0/scopt_2.11-3.5.0.jar
108K com/jsuereth/scala-arm_2.11/2.0/scala-arm_2.11-2.0.jar
96K commons-pool/commons-pool/1.5.4/commons-pool-1.5.4.jar
88K org/spire-math/spire-macros_2.11/0.7.4/spire-macros_2.11-0.7.4.jar
72K commons-codec/commons-codec/1.5/commons-codec-1.5.jar
44K org/typelevel/discipline_2.11/0.7.2/discipline_2.11-0.7.2.jar
44K org/slf4j/slf4j-api/1.7.25/slf4j-api-1.7.25.jar
44K org/apache/parquet/parquet-common/1.9.0/parquet-common-1.9.0.jar
36K org/typelevel/machinist_2.11/0.6.1/machinist_2.11-0.6.1.jar
24K com/typesafe/scala-logging/scala-logging-slf4j_2.11/2.1.2/scala-logging-slf4j_2.11-2.1.2.jar
20K net/sf/opencsv/opencsv/2.3/opencsv-2.3.jar
16K org/scala-sbt/test-interface/1.0/test-interface-1.0.jar
12K org/typelevel/catalysts-macros_2.11/0.0.5/catalysts-macros_2.11-0.0.5.jar
12K org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar
8,0K org/typelevel/cats-macros_2.11/0.9.0/cats-macros_2.11-0.9.0.jar
8,0K com/typesafe/scala-logging/scala-logging-api_2.11/2.1.2/scala-logging-api_2.11-2.1.2.jar
4,0K org/typelevel/macro-compat_2.11/1.1.1/macro-compat_2.11-1.1.1.jar
4,0K org/typelevel/cats-jvm_2.11/0.9.0/cats-jvm_2.11-0.9.0.jar
4,0K org/typelevel/cats_2.11/0.9.0/cats_2.11-0.9.0.jar
4,0K org/typelevel/catalysts-platform_2.11/0.0.5/catalysts-platform_2.11-0.0.5.jar
It might be better to rely on the breeze version shipped with the dataflow engine rather than bundling our own. @ParkL could you check the versions bundled with Spark 2.1.0 and Flink 1.2.1?
I am not sure what to do with scalaz
. It seems that we're only using it due to quiver
, and I am not aware of any alternative which has smaller footprint or, say, relies on cats
.
I am open for suggestions.
Flink 1.2.0 bundles breeze 0.12.
Spark 2.1.0 bundles breeze 0.12 as well, but with some exclusions.
I am not sure whether those are available in the classpath when submitting a job against a running cluster.
I can access Breeze in the Spark REPL, but not in the Flink REPL. I have a few questions:
- How should libraries available in Spark and Flink be scoped - as
provided
? - Why are test libraries like
scalacheck
andcats-laws
submitted with the jar? - Currently
quiver
is only needed at compile time. Can't we exclude it from the jar?
- Yes, thinks that could be found in the Flink or Spark classpath should be marked as provided. My understanding is that those are excluded from the fat-jar built by the shade plugin.
- I guess that those can be found along some non-(
test or
provided`) path in the dependency tree. - This is a great idea!
Breeze is not in the Flink REPL because it's not a top level dependency in Flink (it's only listed in Flink's ML library).
So MLLib and Flink-ML have a dependency on Breeze and Breeze has a dependency on Shapeless. The newest version of Breeze depends on the newest version of Shapeless, but MLLib and Flink-ML reference older versions of Breeze.