[WIP][SPARK-54571][CORE] Use LZ4 safeDecompressor
What changes were proposed in this pull request?
In recent LZ4 versions, safeDecompressor has become highly optimized and can be as fast, or even sometimes faster, than fasterDecompressor. So it does make sense to switch to safeDecompressor.
Why are the changes needed?
It is recommended to switch to .safeDecompressor(), which is not vulnerable and provides better performance per https://sites.google.com/sonatype.com/vulnerabilities/cve-2025-12183
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No
It's more involving than I thought as LZ4BlockInputStream doesn't take safeDecompressor. I will take a deeper look tomorrow.
It seems that the fix for CVE‐2025‐12183 wasn't implemented until version 1.8.1, but Spark is still using version 1.8.0.
https://github.com/yawkat/lz4-java/releases
Note that LZ4BlockInputStream does not support safeDecompressor in lz4-java 1.8.1. If you upgrade to that version, it will still work and be secure, but performance will be much worse than in 1.8.0.
lz4-java 1.10.0 introduces a new builder for LZ4BlockInputStream that accepts a safeDecompressor.
It is published, but only under the new group id
@yawkat, which group id does 1.10.0 publish?
we need change to use
<dependency>
<groupId>at.yawk.lz4</groupId>
<artifactId>lz4-java</artifactId>
<version>1.10.0</version>
</dependency>
https://repo.maven.apache.org/maven2/at/yawk/lz4/lz4-java/
Thank you for the updated info, @yawkat , @LuciferYang .
Could you update this PR, @dbtsai ?
To all, in order to help this PR, I made an independent PR for dependency upgrade.
- https://github.com/apache/spark/pull/53327
We upgraded to 1.10.0 a few minutes ago. Could you rebase this PR to master branch once more, @dbtsai ?
- https://github.com/apache/spark/pull/53327
I recommend you wait a few hours with releasing this. Another (smaller, unrelated) CVE has been found in lz4-java.
CVE-2025-66566 has been published and fixed in 1.10.1. I suggest you move to that version. Though cloudflare seems to be having some trouble that breaks maven central at the moment.
I recommend you wait a few hours with releasing this. Another (smaller, unrelated) CVE has been found in lz4-java.
Just FYI, we have no intention to hurry this, @yawkat . To be safe, this will be tested in master branch for next few months in Apache Spark 4.2.0 timeframe. Currently, we are voting on Apache Spark 4.1.0, it's too late for this kind of risky dependency change.
That's the main reason why LZ4 1.10.0 PR is only in master branch. Dependency change and code change is only building a ground to test further in various workloads.
- #53327
Gentle ping once more, @dbtsai .
Gentle ping, @dbtsai .
The PR description mentions that safeDecompressor is fast enough.
While the CVE is a good enough reason to switch - can we link benchmark results to the PR description to point to this ?
I am not aware of spark benchmarks, but as of 1.8.1, safeDecompressor is substantially faster than fastDecompressor. In earlier versions, the difference was minor.
@yawkat, how about the performance for 1.10.0? @dongjoon-hyun has already bumped to 1.10.0.
Performance between 1.8.1 and 1.10.1 has not changed substantially.
@yawkat I am not questioning whether it is faster or not - I am sure it is, else @dbtsai wont be raising this PR :) Given the PR description, adding a reference, which makes this clear would help.
@mridulm I found a perf report at https://github.com/yawkat/lz4-java/pull/3#issuecomment-3599886694, but without providing the data.
@mridulm @yawkat @dbtsai @dongjoon-hyun @SteNicholas I created #53453 to add an lz4 benchmark based on TPCDS catalog_sales.dat (SF1), the size is about 283M. My test results show lz4-java 1.10.1 is about 10~15% slower on lz4 compression than 1.8.0, and is about 5% slower on lz4 decompression even with migrating to suggested safeDecompressor (https://github.com/apache/spark/pull/53454)
The underlying lz4 library was updated in 1.9.0 so a performance difference is possible.
While this is being merged, would you consider "setting spark.io.compression.codec to something other than lz4" as good advice for users potentially affected by the vulnerabilities?
As far as I saw, only this config key has LZ4 as default. Also, it seems like spark.eventLog.compression.codec is its own setting, even though the former config key mentions it being used for writing event logs. I guess this is historic and an error in the docs?