spark-rapids POM Changes for Spark 4.0.0 [databricks]

This change adds the profiles needed to build the plugin for Spark 4.0.0

This PR depends on #10993

fixes #9259

Jun 06 '24 23:06 razajafri

This is in draft until https://github.com/NVIDIA/spark-rapids/pull/10993 https://github.com/NVIDIA/spark-rapids/pull/10992 are merged

Jun 07 '24 15:06 razajafri

pre-commit task failing for Spark400 as the private repo changes haven't been merged.

Jun 07 '24 17:06 razajafri

@gerashegalov do you have any more questions on this PR?

Jun 14 '24 17:06 razajafri

The 400 Pr check is failing. We need to wait until spark-rapids-private artifact is published https://github.com/NVIDIA/spark-rapids/actions/runs/9519648358/job/26243644463?pr=10994

Jun 14 '24 21:06 gerashegalov

I was wondering if this could be checked in before #11066. It doesn't need to wait, does it?

Jun 24 '24 17:06 mythrocks

@razajafri @mythrocks there are two pending issues with this PR indicated by the failing check

:x: mvn[compile,RAT,scalastyle,docgen] / package-tests-scala213 (400, true) (pull_request)

buildver=400 cannot be built due to the corresponding spark-rapids-private dependency being missing
buildver=400 check should also be moved under JDK17

It currently fails with

Error:  Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:3.5.0:enforce (enforce-java-version) on project rapids-4-spark-parent_2.13: 
Error:  Rule 0: org.apache.maven.enforcer.rules.version.RequireJavaVersion failed with message:
Error:  Support for Spark 4.0.0 is only available with Java 17+

Update https://github.com/NVIDIA/spark-rapids/blob/4b449034f2a0105c687646176590b349f9901ea7/pom.xml#L871-L873

We need to exclude 400 from from PR checks based on JDK8,11

Jun 24 '24 22:06 gerashegalov

Depends on #11092

Jun 25 '24 05:06 razajafri

@razajafri @mythrocks there are two pending issues with this PR indicated by the failing check

❌ mvn[compile,RAT,scalastyle,docgen] / package-tests-scala213 (400, true) (pull_request)

buildver=400 cannot be built due to the corresponding spark-rapids-private dependency being missing

buildver=400 check should also be moved under JDK17

It currently fails with
Error:  Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:3.5.0:enforce (enforce-java-version) on project rapids-4-spark-parent_2.13: 
Error:  Rule 0: org.apache.maven.enforcer.rules.version.RequireJavaVersion failed with message:
Error:  Support for Spark 4.0.0 is only available with Java 17+
Update

https://github.com/NVIDIA/spark-rapids/blob/4b449034f2a0105c687646176590b349f9901ea7/pom.xml#L871-L873

We need to exclude 400 from from PR checks based on JDK8,11

It's not that simple as our rules will pick up everything from the pom in snapshots and nosnapshots (I believe) to build against jdk 8. The jdk17.buildvers tag will build using jdk 17 but with Scala 2.12. I am working with @pxLi to come up with changes to the mvn-verify github actions workflow and possibly version-defs.sh

Jun 25 '24 06:06 razajafri

~~The spark400 would fail jdk8+scala213 github action check. As its optional, we could add a fix later.~~

Q: are we sure about the new rule for spark 400+, if all 400+ following versions would only build against jdk17+scala213?

Jun 25 '24 06:06 pxLi

also do we want to release double dist jars as it should not include both jdk8 and jdk17 compiled objects? @sameerz @gerashegalov @jlowe @revans2 @tgravescs

Jun 25 '24 06:06 pxLi

If spark400 is for only debug purpose, then

for nightly build CI, we can add a separated pipeline with JDK17+scala2.13 environment to build the spark400 shim;

for the pre-merge github workflow, we can add a new profile like <snapshot400+> in the pom.xml, and a new job/step in .github/workflows/mvn-verify-check.yml specified for the profile 400 shims

Jun 25 '24 07:06 NvTimLiu

~The spark400 would fail jdk8+scala213 github action check. As its optional, we could add a fix later.~

Q: are we sure about the new rule for spark 400+, if all 400+ following versions would only build against jdk17+scala213?

Yes spark 400+ requires jdk17+scala213

Spark 4.0.0 is only supported for Scala 2.13
Support for Spark 4.0.0 is only available with Java 17+

Jun 25 '24 07:06 NvTimLiu

Regarding JDK17+JDK8 in the same jar, I don't think that's technically impossible. jars are just zip files, and each class file separately contains metadata with what JDK it was compiled with/for. If you couldn't combine class files compiled with different JDK versions, uber/shaded jars would have all sorts of problems, since then one would need to worry about what JDK was used to build every dependency. The main issue would be trying to access JDK17-compiled files with JDK8, but if we limit the JDK17 class files to only being accessed by the Spark 4 shim, then this should be OK in practice since Spark 4 requires JDK17 to run and would be the only platform accessing those JDK17-compiled classes in the jar.

There's a bigger issue here though, and it's that I'm guessing most of the unshimmed class files may need to be shimmed not because of APIs changing in Spark 4 but solely because they cannot be compiled with JDK8 and thus cannot generate a common class file across Spark versions. For example, I tried compiling the plugin against Spark 4 with JDK8. It complained that ColumnarCopyHelper, an unshimmed class, cannot compile because Spark's org.apache.spark.sql.execution.vectorized.WritableColumnVector class is compiled with version 61 and JDK8 only supports up to 52. When I hacked around that error, I hit GpuColumnVector, another unshimmed class, cannot compile because Spark's org.apache.spark.sql.catalyst.expressions.Attribute class was compiled with 62 instead of 51. All of this implies that any plugin class that touches a Spark class would need to be shimmed, which would be almost the entire plugin jar.

Jun 25 '24 15:06 jlowe

JDK 9+ has a backwards-compatible solution to the problem of mixing code depending on different JDK releases https://docs.oracle.com/en/java/javase/21/docs/specs/jar/jar.html#multi-release-jar-files.

So a naive and robust solution is to group the existing build process by JDK and then combine in a multi-release jar. Looks like the file utility offers an easy way to script multi-release jar creation that is oblivious to the upstream pipeline:

$ file dist/target/parallel-world/com/nvidia/spark/SQLPlugin.class 
dist/target/parallel-world/com/nvidia/spark/SQLPlugin.class: compiled Java class data, version 52.0 (Java 1.8)

$ file scala2.13/dist/target/parallel-world/com/nvidia/spark/SQLPlugin.class 
scala2.13/dist/target/parallel-world/com/nvidia/spark/SQLPlugin.class: compiled Java class data, version 61.0

later we can optimize duplication with methodology like #6555 and refactor to reduce the number of classes that depend on Spark

Most likely we should not need changes for the versions of Spark before 4.0.0 even if the user has a custom build with JDK9+ because by default it cross-compiles down to JDK8 https://github.com/apache/spark/blob/0db5bdecfa6cbfff1be7690bb783a858989776b9/pom.xml#L115-L117

If cross-compilation to JDK8 is not used we need to test if it works via runtime backwards-compatibility or there is some metadata remembering the compile time class file format version.

Jun 25 '24 17:06 gerashegalov

I was able to manufacture a working multi-release jar with spark400 and spark330.

I am exploring cross-compilation from JDK17 to JDK8. It should require less work on the CI side.

Because 400 was not previously tested in combination with other shims there is outstanding coding work. Most notably some unshimmed classes extend internal.Logging which pull in different methods between 3.x and 4.x.

Jun 26 '24 20:06 gerashegalov

verify-all-modules (311, 8) is in a state of Expected as we haven't updated the parameters list on the github actions. We will do that once this PR is ready to merge as it will affect all the PRs

Jun 26 '24 21:06 razajafri

Got the second prototype working using cross-compile https://github.com/razajafri/spark-rapids/pull/3

So proposal is to simply switch at least the Scala2.13 pipeline to JDK17 for building. Test runs for Shims other than 4.0.0 can be executed with both JDK 8 or JDK 17. Spark 4.0 tests can only be run on 17+

Jun 27 '24 21:06 gerashegalov

@pxLi All the cdh shims are failing because they are missing the flatbuffers-java dependency. Can you help me narrow this down.

Jun 30 '24 20:06 razajafri

@pxLi All the cdh shims are failing because they are missing the flatbuffers-java dependency. Can you help me narrow this down.

I saw compilation error but not missing dep issue. If the incompatible issue was not introduced in this PR, please file an issue for developer to check thanks

[INFO] compiling 459 Scala sources and 58 Java sources to /home/runner/work/spark-rapids/spark-rapids/sql-plugin/target/spark332cdh/classes ...
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:112: value createUnintializedVector is not a member of com.google.flatbuffers.FlatBufferBuilder
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:235: value createUnintializedVector is not a member of com.google.flatbuffers.FlatBufferBuilder
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:252: type ByteBufferFactory is not a member of object com.google.flatbuffers.FlatBufferBuilder
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:303: value createUnintializedVector is not a member of com.google.flatbuffers.FlatBufferBuilder
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:319: overloaded method constructor FlatBufferBuilder with alternatives:
  (x$1: java.nio.ByteBuffer)com.google.flatbuffers.FlatBufferBuilder <and>
  ()com.google.flatbuffers.FlatBufferBuilder <and>
  (x$1: Int)com.google.flatbuffers.FlatBufferBuilder
 cannot be applied to (Int, com.nvidia.spark.rapids.DirectByteBufferFactory)
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:328: overloaded method constructor FlatBufferBuilder with alternatives:
  (x$1: java.nio.ByteBuffer)com.google.flatbuffers.FlatBufferBuilder <and>
  ()com.google.flatbuffers.FlatBufferBuilder <and>
  (x$1: Int)com.google.flatbuffers.FlatBufferBuilder
 cannot be applied to (Int, com.nvidia.spark.rapids.DirectByteBufferFactory)
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:342: overloaded method constructor FlatBufferBuilder with alternatives:
  (x$1: java.nio.ByteBuffer)com.google.flatbuffers.FlatBufferBuilder <and>
  ()com.google.flatbuffers.FlatBufferBuilder <and>
  (x$1: Int)com.google.flatbuffers.FlatBufferBuilder
 cannot be applied to (Int, com.nvidia.spark.rapids.DirectByteBufferFactory)
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:368: overloaded method constructor FlatBufferBuilder with alternatives:
  (x$1: java.nio.ByteBuffer)com.google.flatbuffers.FlatBufferBuilder <and>
  ()com.google.flatbuffers.FlatBufferBuilder <and>
  (x$1: Int)com.google.flatbuffers.FlatBufferBuilder
 cannot be applied to (Int, com.nvidia.spark.rapids.DirectByteBufferFactory)
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:322: local val finIndex in method buildMetaResponse is never used
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:337: local val finIndex in method buildShuffleMetadataRequest is never used
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:375: local val root in method buildBufferTransferResponse is never used
Error:  /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/MetaUtils.scala:390: local val transferRequestOffset in method buildTransferRequest is never used
Error: [ERROR] 12 errors found

Jul 01 '24 00:07 pxLi

I cannot repro the cdh compilation issue with the latest upstream. I would suppose that it was introduced by plugin version updates in this change.

While using your change, but keep scala-maven-plugin as 4.3.0 (not new default 4.7.1) for cdh profiles passed the compilation.

I am not quite familiar with CDH build ENV cc @tgravescs to help check if any known issue or if we must hardcode some compiling tooling versions for CDH ones, thanks

Jul 01 '24 00:07 pxLi

Filed an issue for Cloudera folks to add support for their shims against JDK 17 here

Jul 01 '24 05:07 razajafri

build

Jul 01 '24 05:07 razajafri

Blossom build is dependent on https://github.com/NVIDIA/spark-rapids/issues/11113

Jul 01 '24 05:07 razajafri

Blossom build is dependent on https://github.com/NVIDIA/spark-rapids/issues/11113

The pre-merge CI change for the scala2.13 + JDK17 build/test has been done ： https://github.com/NVIDIA/spark-rapids/pull/11117

Jul 02 '24 09:07 NvTimLiu

For the scala2.13 final dist jar's nightly build and IT, as discussed with @razajafri - We will build Scala 2.13 dist jar for spark 330+ with jdk 17 - We should run Scala2.13 dist jar's integration tests with jdk 8 runtime

Also Peixin and I discussed, we may need more IT matrix against the dist jar like : Scala2.13 + [JDK8/JDK11/JDK17]

@gerashegalov @jlowe @revans2 @tgravescs Please let me know if you have any concern, thanks!

cc @pxLi @GaryShen2008 @sameerz

Blossom nightly build issue: https://github.com/NVIDIA/spark-rapids/issues/11114

Jul 02 '24 09:07 NvTimLiu

I could not reproduce CDH compile issue https://github.com/NVIDIA/spark-rapids/pull/10994#issuecomment-2198819837 with scala-maven-plugin 4.7.1. It compiled fine for me.

Moreover CDH cannot be built with the PR in the current shape using JDK17:

$ JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 mvn -version
Apache Maven 3.9.3 (21122926829f1ead511c958d89bd2f672198ae9f)
Maven home: /home/gshegalov/dist/apache-maven-3.9.3
Java version: 17.0.11, vendor: Ubuntu, runtime: /usr/lib/jvm/java-17-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "6.5.0-41-generic", arch: "amd64", family: "unix"

$ JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 mvn clean install -DskipTests -Dmaven.scaladoc.skip -Dbuildver=321cdh -pl aggregator -am 

12:13:12,271 [INFO] Compiling 12 Scala sources and 2 Java sources to /home/gshegalov/gits/NVIDIA/spark-rapids.worktrees/spark400/sql-plugin-api/target/spark321cdh/classes ...
12:13:13,890 [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
12:13:13,985 [ERROR] [Error] : warnings found and -Werror specified

Jul 02 '24 19:07 gerashegalov

Talking to @razajafri it turns out that the CDH failure comment https://github.com/NVIDIA/spark-rapids/pull/10994#issuecomment-2198819837 refers to the this very PR workflow failure https://github.com/NVIDIA/spark-rapids/actions/runs/9732795299/job/26860542937. I suspect that the reason is the change in this PR to make cache-dependencies a matrix job. Then these caches clobber each other somehow.

Jul 02 '24 22:07 gerashegalov

I can reproduce the issue locally with the command below,

openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
Apache Maven 3.9.7 (8b094c9513efc1b9ce2d952b3b9c8eaedaf8cbf0)
Maven home: /opt/homebrew/Cellar/maven/3.9.7/libexec
Java version: 1.8.0_292, vendor: AdoptOpenJDK

mvn package -am -pl integration_tests,tests,tools -P 'individual,pre-merge' -Dbuildver=321cdh -Dmaven.scalastyle.skip=true -Drat.skip=true
# also
mvn clean install -DskipTests -Dmaven.scaladoc.skip -Dbuildver=321cdh -pl aggregator -am

@razajafri @gerashegalov can you share your repro command and ENV info? Have you tried JDK8 which cdh shim should be built with? and which commit are you building against? I am building against merging this change (https://github.com/razajafri/spark-rapids/tree/SP-9259-POM-changes) into upstream branch-24.08, and simply revert hardcoded 430

UPDATE: ~~I also tried JDK17, but it failed the same. can you try purging your local .m2? there might be some old transitive dependencies, but a fresh run could introduce new versions which caused the issue.~~ use below commands for local reproduce

repro also within our nightly docker image (mvn 3.6.3 + JDK8)

docker run --rm -it urm.nvidia.com/sw-spark-docker-local/plugin:dev-ubuntu20-cuda11.0.3-blossom-dev bash

git clone -b test-pom-changes https://github.com/razajafri/spark-rapids.git
cd spark-rapids
mvn -version
mvn package -am -pl integration_tests,tests,tools -P 'individual,pre-merge' -Dbuildver=321cdh -Dmaven.scalastyle.skip=true -Drat.skip=true

Jul 03 '24 02:07 pxLi

I can reproduce the issue locally with the command below,
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
Apache Maven 3.9.7 (8b094c9513efc1b9ce2d952b3b9c8eaedaf8cbf0)
Maven home: /opt/homebrew/Cellar/maven/3.9.7/libexec
Java version: 1.8.0_292, vendor: AdoptOpenJDK

mvn package -am -pl integration_tests,tests,tools -P 'individual,pre-merge' -Dbuildver=321cdh -Dmaven.scalastyle.skip=true -Drat.skip=true
# also
mvn clean install -DskipTests -Dmaven.scaladoc.skip -Dbuildver=321cdh -pl aggregator -am 
@razajafri @gerashegalov can you share your repro command and ENV info? Have you tried JDK8 which cdh shim should be built with? and which commit are you building against? I am building against merging this change (https://github.com/razajafri/spark-rapids/tree/SP-9259-POM-changes) into upstream branch-24.08, and simply revert hardcoded 430

UPDATE: ~I also tried JDK17, but it failed the same. can you try purging your local .m2? there might be some old transitive dependencies, but a fresh run could introduce new versions which caused the issue.~ use below commands for local reproduce

repro also within our nightly docker image (mvn 3.6.3 + JDK8)
docker run --rm -it urm.nvidia.com/sw-spark-docker-local/plugin:dev-ubuntu20-cuda11.0.3-blossom-dev bash

git clone -b test-pom-changes https://github.com/razajafri/spark-rapids.git
cd spark-rapids
mvn -version
mvn package -am -pl integration_tests,tests,tools -P 'individual,pre-merge' -Dbuildver=321cdh -Dmaven.scalastyle.skip=true -Drat.skip=true

I am able to repro on branch-24.08 with no other change but just upgrading the scala.maven.plugin version to 4.7.1. I can spend more time tomorrow on this

Jul 03 '24 06:07 razajafri

Upgrading scala-maven-plugin version to 4.9.1 which is what Spark is using now fixes the problem.

Jul 03 '24 06:07 razajafri