Add spark350emr shim layer [EMR]
Summary
This PR targets to add a new shim layer spark350emr which supports running Spark RAPIDS on AWS EMR Spark 3.5.0.
Testing
Unit Testing
I ran full suite of unit tests with example command as below:
mvn clean install
and got the following results:
Run completed in 30 minutes, 23 seconds.
Total number of tests run: 1214
Suites: completed 123, aborted 0
Tests: succeeded 1205, failed 9, canceled 54, ignored 16, pending 0
After some investigation and analysis, I found the following:
AdaptiveQueryExecSuite
- Join partitioned tables DPP fallback *** FAILED ***
- Exchange reuse *** FAILED ***
NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
Part of the plan is not columnar class org.apache.spark.sql.execution.FilterExec
SortExecSuite
- IGNORE ORDER: join longs *** FAILED ***
- IGNORE ORDER: join longs multiple batches *** FAILED ***
NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
CostBasedOptimizerSuite
- Force section of plan back onto CPU, AQE off *** FAILED ***
- keep CustomShuffleReaderExec on GPU *** FAILED ***
- Compute estimated row count nested joins no broadcast *** FAILED ***
NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.FilterExec
JoinsSuite
- IGNORE ORDER: IGNORE ORDER: Test hash join *** FAILED ***
- INCOMPAT, IGNORE ORDER: Test replace sort merge join with hash join *** FAILED ***
[NOTE:]([url]()) Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
We started an email-thread before to discuss these unit test failures but we still don't fix these test failures yet and need some help on these ones.
Integration Testing
I ran full suite of integration tests and found some test failures we've discussed before as follows:
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication[Decimal(38,10)][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Integer-Decimal(30,10)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(10,-2)-Decimal(30,10)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(15,3)-Decimal(30,10)][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(30,12)-Integer][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(30,12)-Decimal(16,7)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(27,7)-Decimal(16,7)][DATAGEN_SEED=0, INJECT_OOM]
The failed reason is that AWS EMR back-ported the fix about this issue in Spark 3.5.0. Thus, I mark them as xfail in this change.
Reproduction
To reproduce my test results, you can use the following environment:
- This PR as a patch based on
branch-24.02. - AWS EMR Spark:
3.5.0-amzn-0. You can get it from AWS EMR cluster with releaseemr-7.0.0.
Also, for your convenience, I've already attached this patch and scala-test-detailed-output.log as here: unit_tests.zip
Others
For this change to build successfully, we also need the following change:
# before
<dependency>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-private_${scala.binary.version}</artifactId>
<version>${spark-rapids-private.version}</version>
<classifier>${spark.version.classifier}</classifier>
</dependency>
# after
<dependency>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-private_${scala.binary.version}</artifactId>
<version>${spark-rapids-private.version}</version>
<classifier>spark350</classifier>
</dependency>
I think this is because that there is no such a private artifact for spark350emr. Do we need additional change for this or will the NVIDIA take care of it?
This PR will add a new 350emr shim into branch-24.02, will this be included into our 24.02.0 release?
Suppose compiling and integration tests will be running on EMR cluster.
This would be a risk as burndown is only 1 week away 2024-01-29 according to our release plan, but may don't have time setting up compiling/integration test CI jobs on EMR cluster.
For failing PR checks you need to run ./build/make-scala-version-build-files.sh 2.13 and merge modified poms to the PR branch https://github.com/NVIDIA/spark-rapids/blob/645daa166a781b47088de4c971af1da933caac23/CONTRIBUTING.md?plain=1#L72
Closing until we can retarget to the latest branch