spark-rapids Add spark350emr shim layer [EMR]

Summary

This PR targets to add a new shim layer spark350emr which supports running Spark RAPIDS on AWS EMR Spark 3.5.0.

Testing

Unit Testing

I ran full suite of unit tests with example command as below:

mvn clean install

and got the following results:

Run completed in 30 minutes, 23 seconds.
Total number of tests run: 1214
Suites: completed 123, aborted 0
Tests: succeeded 1205, failed 9, canceled 54, ignored 16, pending 0

After some investigation and analysis, I found the following:

AdaptiveQueryExecSuite
- Join partitioned tables DPP fallback *** FAILED ***
- Exchange reuse *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
Part of the plan is not columnar class org.apache.spark.sql.execution.FilterExec

SortExecSuite
- IGNORE ORDER: join longs *** FAILED ***
- IGNORE ORDER: join longs multiple batches *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec

CostBasedOptimizerSuite
- Force section of plan back onto CPU, AQE off *** FAILED ***
- keep CustomShuffleReaderExec on GPU *** FAILED ***
- Compute estimated row count nested joins no broadcast *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.FilterExec

JoinsSuite
- IGNORE ORDER: IGNORE ORDER: Test hash join *** FAILED ***
- INCOMPAT, IGNORE ORDER: Test replace sort merge join with hash join *** FAILED ***

[NOTE:]([url]()) Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec

We started an email-thread before to discuss these unit test failures but we still don't fix these test failures yet and need some help on these ones.

Integration Testing

I ran full suite of integration tests and found some test failures we've discussed before as follows:

FAILED src/main/python/arithmetic_ops_test.py::test_multiplication[Decimal(38,10)][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Integer-Decimal(30,10)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(10,-2)-Decimal(30,10)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(15,3)-Decimal(30,10)][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(30,12)-Integer][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(30,12)-Decimal(16,7)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(27,7)-Decimal(16,7)][DATAGEN_SEED=0, INJECT_OOM]

The failed reason is that AWS EMR back-ported the fix about this issue in Spark 3.5.0. Thus, I mark them as xfail in this change.

Reproduction

To reproduce my test results, you can use the following environment:

This PR as a patch based on branch-24.02.
AWS EMR Spark: 3.5.0-amzn-0. You can get it from AWS EMR cluster with release emr-7.0.0.

Also, for your convenience, I've already attached this patch and scala-test-detailed-output.log as here: unit_tests.zip

Others

For this change to build successfully, we also need the following change:

# before
<dependency>
            <groupId>com.nvidia</groupId>
            <artifactId>rapids-4-spark-private_${scala.binary.version}</artifactId>
            <version>${spark-rapids-private.version}</version>
            <classifier>${spark.version.classifier}</classifier>
 </dependency>
# after
<dependency>
            <groupId>com.nvidia</groupId>
            <artifactId>rapids-4-spark-private_${scala.binary.version}</artifactId>
            <version>${spark-rapids-private.version}</version>
            <classifier>spark350</classifier>
 </dependency>

I think this is because that there is no such a private artifact for spark350emr. Do we need additional change for this or will the NVIDIA take care of it?

Jan 16 '24 09:01 mimaomao

This PR will add a new 350emr shim into branch-24.02, will this be included into our 24.02.0 release?

Suppose compiling and integration tests will be running on EMR cluster.

This would be a risk as burndown is only 1 week away 2024-01-29 according to our release plan, but may don't have time setting up compiling/integration test CI jobs on EMR cluster.

Jan 18 '24 14:01 NvTimLiu

For failing PR checks you need to run ./build/make-scala-version-build-files.sh 2.13 and merge modified poms to the PR branch https://github.com/NVIDIA/spark-rapids/blob/645daa166a781b47088de4c971af1da933caac23/CONTRIBUTING.md?plain=1#L72

Jan 20 '24 05:01 gerashegalov

Closing until we can retarget to the latest branch

Jul 30 '24 20:07 sameerz