hail icon indicating copy to clipboard operation
hail copied to clipboard

feature request: Java 17 support

Open pettyalex opened this issue 3 months ago • 2 comments

What happened?

Hello,

It looks like Hail has a hard coded check to only run on Java 8 and 11, despite Spark supporting Java 17 for a couple years now, including on spark 3.3.x, which is the currently used release for pip install hail: https://spark.apache.org/releases/spark-release-3-3-0.html#build

Would it be possible to add Java 17 support, or possibly even remove the Java version check in general so that it can track what underlying Spark does without additional updates?

There are a bunch of benefits of moving to Java 17, including:

  1. https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html - Significant garbage collector improvements that will likely improve throughput and reduce costs
  2. https://vmnotescom.wordpress.com/2021/09/14/java-17-whats-new-removed-and-preview-in-jdk-17/ - Better Apple Silicon support. I know that darwin-aarch64 has been backported to 8 and 11, but 17 is faster on that platform.
  3. https://spark.apache.org/releases/spark-release-3-5-0.html#removals-behavior-changes-and-deprecations - The next release of Spark will require Java 17 as a minimum version, and making the change now is easier than making more changes all at once in the future.

The following features will be removed in the next Spark major release

Support for Java 8 and Java 11, and the minimal supported Java version will be Java 17 Support for Scala 2.12, and the minimal supported Scala version will be 2.13

Also, requiring specifically Java 8 or 11 has led to some friction for students and researchers who are first evaluating hail. In the past few weeks, I've talked to a lot of students and researchers who wanted to evaluate hail, followed the documentation to install Azul Java 8 but already had an existing Java install and did not update their PATH or JAVA_HOME. Most of their existing Java versions were 17, as 17 is the current default on most Linux distros and a common one to have been installed via Brew in the past few years on Mac.

Alternatively, if you don't want to allow Java 17, potentially Hail could bundle a JRE with it in the PyPI distributable? Using jdeps on the Hail shadow jar, I saw that it only needs these modules in a JRE:

java.base,java.compiler,java.desktop,java.management,java.naming,java.rmi,java.scripting,java.security.jgss,java.sql,jdk.httpserver,jdk.unsupported

That means that a JRE created with jlink like this:

jlink --add-modules java.base,java.compiler,java.desktop,java.management,java.naming,java.rmi,java.scripting,java.security.jgss,java.sql,jdk.httpserver,jdk.unsupported --output minimaljre --strip-debug --no-man-pages --no-header-files --compress=2

comes in at under 30MB gzipped, which would increase the PyPI package by about 20% in size, while allowing users to install and run Hail in any supported python environment without having to consider Java versions at all. Alternatively, have you ever considered distributing Hail through conda-forge or bioconda, where you could specify a JRE that should be installed with it and automatically linked?

Is there a better channel than Github Issues for feature requests? I realize this is not a bug report, and if you want to just close it and say "nope" that's fine, but I've seen a good number of first-time hail users get a bad impression because of this.

Ramble about other nitpicks below

I don't want to spam this repo with issues, but I also noticed while poking around at hail:

  1. It seems to use the default Java GC, which is now G1 in Java 9+. Performance on newer Javas would likely improve with -XX:+UseParallelGC in java opts
  2. The Hail jar includes module-info.class from azure storage, this broke my first attempt to use jdeps to see what modules it needs. Specifically, hail-all-spark.jar says it exports:
backend % jar --describe-module --file=hail-all-spark.jar
releases: 9

[email protected] jar:file:///Users/alex/src/hail/hail/build/deploy/hail/backend/hail-all-spark.jar!/module-info.class
exports com.azure.storage.blob
exports com.azure.storage.blob.models
exports com.azure.storage.blob.options
exports com.azure.storage.blob.sas
exports com.azure.storage.blob.specialized
requires com.azure.storage.common transitive
requires com.azure.storage.internal.avro
requires com.fasterxml.jackson.dataformat.xml
requires java.base mandated
qualified exports com.azure.storage.blob.implementation to com.azure.storage.blob.batch com.azure.storage.blob.cryptography com.azure.storage.file.datalake
qualified exports com.azure.storage.blob.implementation.models to com.azure.storage.blob.batch com.azure.storage.blob.cryptography
qualified exports com.azure.storage.blob.implementation.util to com.azure.storage.blob.batch com.azure.storage.blob.changefeed com.azure.storage.blob.cryptography com.azure.storage.blob.nio com.azure.storage.file.datalake com.fasterxml.jackson.databind
qualified opens com.azure.storage.blob.implementation to com.azure.core com.fasterxml.jackson.databind
qualified opens com.azure.storage.blob.implementation.models to com.azure.core com.fasterxml.jackson.databind
qualified opens com.azure.storage.blob.models to com.azure.core com.fasterxml.jackson.databind
contains com.azure.storage.blob.implementation.accesshelpers

You probably don't want this, shadow jars are usually not modular and you can exclude the module-info from them: https://github.com/johnrengelman/shadow/issues/352

Version

0.2.128

Relevant log output

No response

pettyalex avatar Apr 01 '24 17:04 pettyalex