POM Changes for Spark 4.0.0 [databricks]
This change adds the profiles needed to build the plugin for Spark 4.0.0
This PR depends on #10993
fixes #9259
This is in draft until https://github.com/NVIDIA/spark-rapids/pull/10993 https://github.com/NVIDIA/spark-rapids/pull/10992 are merged
pre-commit task failing for Spark400 as the private repo changes haven't been merged.
@gerashegalov do you have any more questions on this PR?
The 400 Pr check is failing. We need to wait until spark-rapids-private artifact is published https://github.com/NVIDIA/spark-rapids/actions/runs/9519648358/job/26243644463?pr=10994
I was wondering if this could be checked in before #11066. It doesn't need to wait, does it?
@razajafri @mythrocks there are two pending issues with this PR indicated by the failing check
:x: mvn[compile,RAT,scalastyle,docgen] / package-tests-scala213 (400, true) (pull_request)
- buildver=400 cannot be built due to the corresponding spark-rapids-private dependency being missing
- buildver=400 check should also be moved under JDK17
It currently fails with
Error: Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:3.5.0:enforce (enforce-java-version) on project rapids-4-spark-parent_2.13:
Error: Rule 0: org.apache.maven.enforcer.rules.version.RequireJavaVersion failed with message:
Error: Support for Spark 4.0.0 is only available with Java 17+
Update https://github.com/NVIDIA/spark-rapids/blob/4b449034f2a0105c687646176590b349f9901ea7/pom.xml#L871-L873
We need to exclude 400 from from PR checks based on JDK8,11
Depends on #11092
@razajafri @mythrocks there are two pending issues with this PR indicated by the failing check
❌ mvn[compile,RAT,scalastyle,docgen] / package-tests-scala213 (400, true) (pull_request)
- buildver=400 cannot be built due to the corresponding spark-rapids-private dependency being missing
- buildver=400 check should also be moved under JDK17
It currently fails with
Error: Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:3.5.0:enforce (enforce-java-version) on project rapids-4-spark-parent_2.13: Error: Rule 0: org.apache.maven.enforcer.rules.version.RequireJavaVersion failed with message: Error: Support for Spark 4.0.0 is only available with Java 17+Update
https://github.com/NVIDIA/spark-rapids/blob/4b449034f2a0105c687646176590b349f9901ea7/pom.xml#L871-L873
We need to exclude 400 from from PR checks based on JDK8,11
It's not that simple as our rules will pick up everything from the pom in snapshots and nosnapshots (I believe) to build against jdk 8. The jdk17.buildvers tag will build using jdk 17 but with Scala 2.12. I am working with @pxLi to come up with changes to the mvn-verify github actions workflow and possibly version-defs.sh
~~The spark400 would fail jdk8+scala213 github action check. As its optional, we could add a fix later.~~
Q: are we sure about the new rule for spark 400+, if all 400+ following versions would only build against jdk17+scala213?
also do we want to release double dist jars as it should not include both jdk8 and jdk17 compiled objects? @sameerz @gerashegalov @jlowe @revans2 @tgravescs
If spark400 is for only debug purpose, then
for nightly build CI, we can add a separated pipeline with JDK17+scala2.13 environment to build the spark400 shim;
for the pre-merge github workflow, we can add a new profile like <snapshot400+> in the pom.xml, and a new job/step in .github/workflows/mvn-verify-check.yml specified for the profile 400 shims
~The spark400 would fail jdk8+scala213 github action check. As its optional, we could add a fix later.~
Q: are we sure about the new rule for spark 400+, if all 400+ following versions would only build against jdk17+scala213?
Yes spark 400+ requires jdk17+scala213
Spark 4.0.0 is only supported for Scala 2.13
Support for Spark 4.0.0 is only available with Java 17+
Regarding JDK17+JDK8 in the same jar, I don't think that's technically impossible. jars are just zip files, and each class file separately contains metadata with what JDK it was compiled with/for. If you couldn't combine class files compiled with different JDK versions, uber/shaded jars would have all sorts of problems, since then one would need to worry about what JDK was used to build every dependency. The main issue would be trying to access JDK17-compiled files with JDK8, but if we limit the JDK17 class files to only being accessed by the Spark 4 shim, then this should be OK in practice since Spark 4 requires JDK17 to run and would be the only platform accessing those JDK17-compiled classes in the jar.
There's a bigger issue here though, and it's that I'm guessing most of the unshimmed class files may need to be shimmed not because of APIs changing in Spark 4 but solely because they cannot be compiled with JDK8 and thus cannot generate a common class file across Spark versions. For example, I tried compiling the plugin against Spark 4 with JDK8. It complained that ColumnarCopyHelper, an unshimmed class, cannot compile because Spark's org.apache.spark.sql.execution.vectorized.WritableColumnVector class is compiled with version 61 and JDK8 only supports up to 52. When I hacked around that error, I hit GpuColumnVector, another unshimmed class, cannot compile because Spark's org.apache.spark.sql.catalyst.expressions.Attribute class was compiled with 62 instead of 51. All of this implies that any plugin class that touches a Spark class would need to be shimmed, which would be almost the entire plugin jar.
JDK 9+ has a backwards-compatible solution to the problem of mixing code depending on different JDK releases https://docs.oracle.com/en/java/javase/21/docs/specs/jar/jar.html#multi-release-jar-files.
So a naive and robust solution is to group the existing build process by JDK and then combine in a multi-release jar. Looks like the file utility offers an easy way to script multi-release jar creation that is oblivious to the upstream pipeline:
$ file dist/target/parallel-world/com/nvidia/spark/SQLPlugin.class
dist/target/parallel-world/com/nvidia/spark/SQLPlugin.class: compiled Java class data, version 52.0 (Java 1.8)
$ file scala2.13/dist/target/parallel-world/com/nvidia/spark/SQLPlugin.class
scala2.13/dist/target/parallel-world/com/nvidia/spark/SQLPlugin.class: compiled Java class data, version 61.0
later we can optimize duplication with methodology like #6555 and refactor to reduce the number of classes that depend on Spark
Most likely we should not need changes for the versions of Spark before 4.0.0 even if the user has a custom build with JDK9+ because by default it cross-compiles down to JDK8 https://github.com/apache/spark/blob/0db5bdecfa6cbfff1be7690bb783a858989776b9/pom.xml#L115-L117
If cross-compilation to JDK8 is not used we need to test if it works via runtime backwards-compatibility or there is some metadata remembering the compile time class file format version.
I was able to manufacture a working multi-release jar with spark400 and spark330.
I am exploring cross-compilation from JDK17 to JDK8. It should require less work on the CI side.
Because 400 was not previously tested in combination with other shims there is outstanding coding work. Most notably some unshimmed classes extend internal.Logging which pull in different methods between 3.x and 4.x.
verify-all-modules (311, 8) is in a state of Expected as we haven't updated the parameters list on the github actions. We will do that once this PR is ready to merge as it will affect all the PRs
Got the second prototype working using cross-compile https://github.com/razajafri/spark-rapids/pull/3
So proposal is to simply switch at least the Scala2.13 pipeline to JDK17 for building. Test runs for Shims other than 4.0.0 can be executed with both JDK 8 or JDK 17. Spark 4.0 tests can only be run on 17+
@pxLi All the cdh shims are failing because they are missing the flatbuffers-java dependency. Can you help me narrow this down.
@pxLi All the cdh shims are failing because they are missing the
flatbuffers-javadependency. Can you help me narrow this down.
I saw compilation error but not missing dep issue. If the incompatible issue was not introduced in this PR, please file an issue for developer to check thanks
[INFO] compiling 459 Scala sources and 58 Java sources to /home/runner/work/spark-rapids/spark-rapids/sql-plugin/target/spark332cdh/classes ...
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:112: value createUnintializedVector is not a member of com.google.flatbuffers.FlatBufferBuilder
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:235: value createUnintializedVector is not a member of com.google.flatbuffers.FlatBufferBuilder
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:252: type ByteBufferFactory is not a member of object com.google.flatbuffers.FlatBufferBuilder
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:303: value createUnintializedVector is not a member of com.google.flatbuffers.FlatBufferBuilder
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:319: overloaded method constructor FlatBufferBuilder with alternatives:
(x$1: java.nio.ByteBuffer)com.google.flatbuffers.FlatBufferBuilder <and>
()com.google.flatbuffers.FlatBufferBuilder <and>
(x$1: Int)com.google.flatbuffers.FlatBufferBuilder
cannot be applied to (Int, com.nvidia.spark.rapids.DirectByteBufferFactory)
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:328: overloaded method constructor FlatBufferBuilder with alternatives:
(x$1: java.nio.ByteBuffer)com.google.flatbuffers.FlatBufferBuilder <and>
()com.google.flatbuffers.FlatBufferBuilder <and>
(x$1: Int)com.google.flatbuffers.FlatBufferBuilder
cannot be applied to (Int, com.nvidia.spark.rapids.DirectByteBufferFactory)
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:342: overloaded method constructor FlatBufferBuilder with alternatives:
(x$1: java.nio.ByteBuffer)com.google.flatbuffers.FlatBufferBuilder <and>
()com.google.flatbuffers.FlatBufferBuilder <and>
(x$1: Int)com.google.flatbuffers.FlatBufferBuilder
cannot be applied to (Int, com.nvidia.spark.rapids.DirectByteBufferFactory)
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:368: overloaded method constructor FlatBufferBuilder with alternatives:
(x$1: java.nio.ByteBuffer)com.google.flatbuffers.FlatBufferBuilder <and>
()com.google.flatbuffers.FlatBufferBuilder <and>
(x$1: Int)com.google.flatbuffers.FlatBufferBuilder
cannot be applied to (Int, com.nvidia.spark.rapids.DirectByteBufferFactory)
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:322: local val finIndex in method buildMetaResponse is never used
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:337: local val finIndex in method buildShuffleMetadataRequest is never used
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:375: local val root in method buildBufferTransferResponse is never used
Error: /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:390: local val transferRequestOffset in method buildTransferRequest is never used
Error: [ERROR] 12 errors found
I cannot repro the cdh compilation issue with the latest upstream. I would suppose that it was introduced by plugin version updates in this change.
While using your change, but keep scala-maven-plugin as 4.3.0 (not new default 4.7.1) for cdh profiles passed the compilation.
I am not quite familiar with CDH build ENV cc @tgravescs to help check if any known issue or if we must hardcode some compiling tooling versions for CDH ones, thanks
Filed an issue for Cloudera folks to add support for their shims against JDK 17 here
build
Blossom build is dependent on https://github.com/NVIDIA/spark-rapids/issues/11113
Blossom build is dependent on https://github.com/NVIDIA/spark-rapids/issues/11113
The pre-merge CI change for the scala2.13 + JDK17 build/test has been done : https://github.com/NVIDIA/spark-rapids/pull/11117
For the scala2.13 final dist jar's nightly build and IT, as discussed with @razajafri - We will build Scala 2.13 dist jar for spark 330+ with jdk 17 - We should run Scala2.13 dist jar's integration tests with jdk 8 runtime
Also Peixin and I discussed, we may need more IT matrix against the dist jar like : Scala2.13 + [JDK8/JDK11/JDK17]
@gerashegalov @jlowe @revans2 @tgravescs Please let me know if you have any concern, thanks!
cc @pxLi @GaryShen2008 @sameerz
Blossom nightly build issue: https://github.com/NVIDIA/spark-rapids/issues/11114
I could not reproduce CDH compile issue https://github.com/NVIDIA/spark-rapids/pull/10994#issuecomment-2198819837 with scala-maven-plugin 4.7.1. It compiled fine for me.
Moreover CDH cannot be built with the PR in the current shape using JDK17:
$ JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 mvn -version
Apache Maven 3.9.3 (21122926829f1ead511c958d89bd2f672198ae9f)
Maven home: /home/gshegalov/dist/apache-maven-3.9.3
Java version: 17.0.11, vendor: Ubuntu, runtime: /usr/lib/jvm/java-17-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "6.5.0-41-generic", arch: "amd64", family: "unix"
$ JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 mvn clean install -DskipTests -Dmaven.scaladoc.skip -Dbuildver=321cdh -pl aggregator -am
12:13:12,271 [INFO] Compiling 12 Scala sources and 2 Java sources to /home/gshegalov/gits/NVIDIA/spark-rapids.worktrees/spark400/sql-plugin-api/target/spark321cdh/classes ...
12:13:13,890 [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
12:13:13,985 [ERROR] [Error] : warnings found and -Werror specified
Talking to @razajafri it turns out that the CDH failure comment https://github.com/NVIDIA/spark-rapids/pull/10994#issuecomment-2198819837 refers to the this very PR workflow failure https://github.com/NVIDIA/spark-rapids/actions/runs/9732795299/job/26860542937. I suspect that the reason is the change in this PR to make cache-dependencies a matrix job. Then these caches clobber each other somehow.
I can reproduce the issue locally with the command below,
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
Apache Maven 3.9.7 (8b094c9513efc1b9ce2d952b3b9c8eaedaf8cbf0)
Maven home: /opt/homebrew/Cellar/maven/3.9.7/libexec
Java version: 1.8.0_292, vendor: AdoptOpenJDK
mvn package -am -pl integration_tests,tests,tools -P 'individual,pre-merge' -Dbuildver=321cdh -Dmaven.scalastyle.skip=true -Drat.skip=true
# also
mvn clean install -DskipTests -Dmaven.scaladoc.skip -Dbuildver=321cdh -pl aggregator -am
@razajafri @gerashegalov can you share your repro command and ENV info? Have you tried JDK8 which cdh shim should be built with? and which commit are you building against? I am building against merging this change (https://github.com/razajafri/spark-rapids/tree/SP-9259-POM-changes) into upstream branch-24.08, and simply revert hardcoded 430
UPDATE: ~~I also tried JDK17, but it failed the same. can you try purging your local .m2? there might be some old transitive dependencies, but a fresh run could introduce new versions which caused the issue.~~ use below commands for local reproduce
repro also within our nightly docker image (mvn 3.6.3 + JDK8)
docker run --rm -it urm.nvidia.com/sw-spark-docker-local/plugin:dev-ubuntu20-cuda11.0.3-blossom-dev bash
git clone -b test-pom-changes https://github.com/razajafri/spark-rapids.git
cd spark-rapids
mvn -version
mvn package -am -pl integration_tests,tests,tools -P 'individual,pre-merge' -Dbuildver=321cdh -Dmaven.scalastyle.skip=true -Drat.skip=true
I can reproduce the issue locally with the command below,
openjdk version "1.8.0_292" OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10) OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode) Apache Maven 3.9.7 (8b094c9513efc1b9ce2d952b3b9c8eaedaf8cbf0) Maven home: /opt/homebrew/Cellar/maven/3.9.7/libexec Java version: 1.8.0_292, vendor: AdoptOpenJDK mvn package -am -pl integration_tests,tests,tools -P 'individual,pre-merge' -Dbuildver=321cdh -Dmaven.scalastyle.skip=true -Drat.skip=true # also mvn clean install -DskipTests -Dmaven.scaladoc.skip -Dbuildver=321cdh -pl aggregator -am@razajafri @gerashegalov can you share your repro command and ENV info? Have you tried JDK8 which cdh shim should be built with? and which commit are you building against? I am building against merging this change (https://github.com/razajafri/spark-rapids/tree/SP-9259-POM-changes) into upstream branch-24.08, and simply revert hardcoded 430
UPDATE: ~I also tried JDK17, but it failed the same. can you try purging your local
.m2? there might be some old transitive dependencies, but a fresh run could introduce new versions which caused the issue.~ use below commands for local reproducerepro also within our nightly docker image (mvn 3.6.3 + JDK8)
docker run --rm -it urm.nvidia.com/sw-spark-docker-local/plugin:dev-ubuntu20-cuda11.0.3-blossom-dev bash git clone -b test-pom-changes https://github.com/razajafri/spark-rapids.git cd spark-rapids mvn -version mvn package -am -pl integration_tests,tests,tools -P 'individual,pre-merge' -Dbuildver=321cdh -Dmaven.scalastyle.skip=true -Drat.skip=true
I am able to repro on branch-24.08 with no other change but just upgrading the scala.maven.plugin version to 4.7.1. I can spend more time tomorrow on this
Upgrading scala-maven-plugin version to 4.9.1 which is what Spark is using now fixes the problem.