[BUG] ZSTD version mismatch in integration tests
Multiple integration tests test_parquet_append_with_downcast, test_parquet_write_column_name_with_dots etc failed with the following error:
[2024-03-13T19:07:42.384Z] E : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2428.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2428.0 (TID 2996) (rapids-it-dataproc-20-ubuntu18-430-<url> executor 1): java.io.IOException: Decompression error: Version not supported
[2024-03-13T19:07:42.384Z] E at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:185)
[2024-03-13T19:07:42.384Z] E at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:137)
I dug into this a bit, and unexpectedly found that the RAPIDS Accelerator is not using ZSTD during these tests. Dataproc 2.0 is running Spark 3.1.x, so the tests avoid trying to use the ZSTD codec in that case. However Spark itself is trying to use ZSTD for the map statistics during shuffle, and that's what's failing during decode,. The RAPIDS Accelerator shouldn't be involved in that code path at all, especially since the RAPIDS shuffle is not configured for these tests.
I tried rolling back to a couple of plugin snapshot versions that were known to pass (one each from 3/10 and 2/28) and they both fail in the same way. I ssh'd to the worker nodes to manually verify the classpath was using the intended jar version and not the new one.
Looks like this is related to SPARK-35199. The workaround provided is to set spark.shuffle.mapStatus.compression.codec to lz4. I believe these issues are fixed starting Spark 3.2 and therefore, could you please confirm if you are observing this issue in Dataproc 2.1 version
Looks like this is related to SPARK-35199. The workaround provided is to set
spark.shuffle.mapStatus.compression.codectolz4. I believe these issues are fixed starting Spark 3.2 and therefore, could you please confirm if you are observing this issue in Dataproc 2.1 version
Tried on 2.0 version, --conf spark.shuffle.mapStatus.compression.codec=lz4 can also PASS all tests cases
[2024-03-27T07:27:28.304Z] ++ SPARK_SUBMIT_FLAGS='--master yarn --num-executors 1
--executor-memory 10G --conf spark.yarn.tags=jenkins-tim-rapids-it-dataproc-2.0-ubuntu18-5
--conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.appMasterEnv.PYSP_TEST_spark_eventLog_enabled=true
--conf spark.sql.adaptive.enabled=true --conf spark.task.cpus=1 --conf spark.task.resource.gpu.amount=0.25
--conf spark.executor.cores=4 --conf spark.locality.wait=0 --conf spark.shuffle.mapStatus.compression.codec=lz4 '
[2024-03-27T07:27:28.304Z] ++ cd integration_tests
[2024-03-27T07:40:44.009Z] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
[2024-03-27T07:40:44.009Z] - generated xml file: integration_tests/target/run_dir-20240327072734-OhFS/TEST-pytest-1711524454084534245.xml -
[2024-03-27T07:40:44.009Z] 428 passed, 18 skipped, 26631 deselected, 22 xfailed, 2 xpassed, 400 warnings in 770.38s (0:12:50)
@NvTimLiu as discussed let's update the Dataproc 2.0 integration tests only to use --conf spark.shuffle.mapStatus.compression.codec=lz4 . Once that is done let's close this issue.
Our fix of Dataproc 2.0 integration tests only adding --conf spark.shuffle.mapStatus.compression.codec=lz4 had been applied in our Jenkins CI script, and the dataproc-2.0-ubuntu18 CI jobs also has been PASS for days
Let's close the issue, thanks