spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] ZSTD version mismatch in integration tests

Open parthosa opened this issue 1 year ago • 3 comments

Multiple integration tests test_parquet_append_with_downcast, test_parquet_write_column_name_with_dots etc failed with the following error:

[2024-03-13T19:07:42.384Z] E  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2428.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2428.0 (TID 2996) (rapids-it-dataproc-20-ubuntu18-430-<url> executor 1): java.io.IOException: Decompression error: Version not supported
[2024-03-13T19:07:42.384Z] E  at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:185)
[2024-03-13T19:07:42.384Z] E  at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:137)

parthosa avatar Mar 13 '24 22:03 parthosa

I dug into this a bit, and unexpectedly found that the RAPIDS Accelerator is not using ZSTD during these tests. Dataproc 2.0 is running Spark 3.1.x, so the tests avoid trying to use the ZSTD codec in that case. However Spark itself is trying to use ZSTD for the map statistics during shuffle, and that's what's failing during decode,. The RAPIDS Accelerator shouldn't be involved in that code path at all, especially since the RAPIDS shuffle is not configured for these tests.

I tried rolling back to a couple of plugin snapshot versions that were known to pass (one each from 3/10 and 2/28) and they both fail in the same way. I ssh'd to the worker nodes to manually verify the classpath was using the intended jar version and not the new one.

jlowe avatar Mar 18 '24 14:03 jlowe

Looks like this is related to SPARK-35199. The workaround provided is to set spark.shuffle.mapStatus.compression.codec to lz4. I believe these issues are fixed starting Spark 3.2 and therefore, could you please confirm if you are observing this issue in Dataproc 2.1 version

jayadeep-jayaraman avatar Mar 20 '24 03:03 jayadeep-jayaraman

Looks like this is related to SPARK-35199. The workaround provided is to set spark.shuffle.mapStatus.compression.codec to lz4. I believe these issues are fixed starting Spark 3.2 and therefore, could you please confirm if you are observing this issue in Dataproc 2.1 version

Tried on 2.0 version, --conf spark.shuffle.mapStatus.compression.codec=lz4 can also PASS all tests cases

[2024-03-27T07:27:28.304Z] ++ SPARK_SUBMIT_FLAGS='--master yarn     --num-executors 1   
 --executor-memory 10G     --conf spark.yarn.tags=jenkins-tim-rapids-it-dataproc-2.0-ubuntu18-5    
 --conf spark.yarn.maxAppAttempts=1     --conf spark.yarn.appMasterEnv.PYSP_TEST_spark_eventLog_enabled=true  
 --conf spark.sql.adaptive.enabled=true  --conf spark.task.cpus=1     --conf spark.task.resource.gpu.amount=0.25     
--conf spark.executor.cores=4  --conf spark.locality.wait=0   --conf spark.shuffle.mapStatus.compression.codec=lz4      '

[2024-03-27T07:27:28.304Z] ++ cd integration_tests


[2024-03-27T07:40:44.009Z] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
[2024-03-27T07:40:44.009Z] - generated xml file: integration_tests/target/run_dir-20240327072734-OhFS/TEST-pytest-1711524454084534245.xml -
[2024-03-27T07:40:44.009Z]  428 passed, 18 skipped, 26631 deselected, 22 xfailed, 2 xpassed, 400 warnings in 770.38s (0:12:50)

NvTimLiu avatar Mar 27 '24 16:03 NvTimLiu

@NvTimLiu as discussed let's update the Dataproc 2.0 integration tests only to use --conf spark.shuffle.mapStatus.compression.codec=lz4 . Once that is done let's close this issue.

sameerz avatar Apr 01 '24 16:04 sameerz

Our fix of Dataproc 2.0 integration tests only adding --conf spark.shuffle.mapStatus.compression.codec=lz4 had been applied in our Jenkins CI script, and the dataproc-2.0-ubuntu18 CI jobs also has been PASS for days

image

Let's close the issue, thanks

NvTimLiu avatar Apr 07 '24 13:04 NvTimLiu