spark icon indicating copy to clipboard operation
spark copied to clipboard

[WIP][SPARK-54571][CORE] Use LZ4 safeDecompressor

Open dbtsai opened this issue 3 weeks ago • 16 comments

What changes were proposed in this pull request?

In recent LZ4 versions, safeDecompressor has become highly optimized and can be as fast, or even sometimes faster, than fasterDecompressor. So it does make sense to switch to safeDecompressor.

Why are the changes needed?

It is recommended to switch to .safeDecompressor(), which is not vulnerable and provides better performance per https://sites.google.com/sonatype.com/vulnerabilities/cve-2025-12183

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

dbtsai avatar Dec 02 '25 22:12 dbtsai

It's more involving than I thought as LZ4BlockInputStream doesn't take safeDecompressor. I will take a deeper look tomorrow.

dbtsai avatar Dec 03 '25 00:12 dbtsai

It seems that the fix for CVE‐2025‐12183 wasn't implemented until version 1.8.1, but Spark is still using version 1.8.0.

https://github.com/yawkat/lz4-java/releases

LuciferYang avatar Dec 03 '25 08:12 LuciferYang

Note that LZ4BlockInputStream does not support safeDecompressor in lz4-java 1.8.1. If you upgrade to that version, it will still work and be secure, but performance will be much worse than in 1.8.0.

lz4-java 1.10.0 introduces a new builder for LZ4BlockInputStream that accepts a safeDecompressor.

yawkat avatar Dec 03 '25 14:12 yawkat

It is published, but only under the new group id

yawkat avatar Dec 03 '25 19:12 yawkat

@yawkat, which group id does 1.10.0 publish?

SteNicholas avatar Dec 04 '25 07:12 SteNicholas

we need change to use

<dependency>
    <groupId>at.yawk.lz4</groupId>
    <artifactId>lz4-java</artifactId>
    <version>1.10.0</version>
</dependency>

https://repo.maven.apache.org/maven2/at/yawk/lz4/lz4-java/

image

LuciferYang avatar Dec 04 '25 07:12 LuciferYang

Thank you for the updated info, @yawkat , @LuciferYang .

Could you update this PR, @dbtsai ?

dongjoon-hyun avatar Dec 04 '25 16:12 dongjoon-hyun

To all, in order to help this PR, I made an independent PR for dependency upgrade.

  • https://github.com/apache/spark/pull/53327

dongjoon-hyun avatar Dec 04 '25 22:12 dongjoon-hyun

We upgraded to 1.10.0 a few minutes ago. Could you rebase this PR to master branch once more, @dbtsai ?

  • https://github.com/apache/spark/pull/53327

dongjoon-hyun avatar Dec 05 '25 03:12 dongjoon-hyun

I recommend you wait a few hours with releasing this. Another (smaller, unrelated) CVE has been found in lz4-java.

yawkat avatar Dec 05 '25 06:12 yawkat

CVE-2025-66566 has been published and fixed in 1.10.1. I suggest you move to that version. Though cloudflare seems to be having some trouble that breaks maven central at the moment.

yawkat avatar Dec 05 '25 09:12 yawkat

I recommend you wait a few hours with releasing this. Another (smaller, unrelated) CVE has been found in lz4-java.

Just FYI, we have no intention to hurry this, @yawkat . To be safe, this will be tested in master branch for next few months in Apache Spark 4.2.0 timeframe. Currently, we are voting on Apache Spark 4.1.0, it's too late for this kind of risky dependency change.

That's the main reason why LZ4 1.10.0 PR is only in master branch. Dependency change and code change is only building a ground to test further in various workloads.

  • #53327

dongjoon-hyun avatar Dec 05 '25 17:12 dongjoon-hyun

Gentle ping once more, @dbtsai .

dongjoon-hyun avatar Dec 07 '25 18:12 dongjoon-hyun

Gentle ping, @dbtsai .

dongjoon-hyun avatar Dec 09 '25 22:12 dongjoon-hyun

The PR description mentions that safeDecompressor is fast enough. While the CVE is a good enough reason to switch - can we link benchmark results to the PR description to point to this ?

mridulm avatar Dec 10 '25 05:12 mridulm

I am not aware of spark benchmarks, but as of 1.8.1, safeDecompressor is substantially faster than fastDecompressor. In earlier versions, the difference was minor.

yawkat avatar Dec 10 '25 05:12 yawkat

@yawkat, how about the performance for 1.10.0? @dongjoon-hyun has already bumped to 1.10.0.

SteNicholas avatar Dec 11 '25 05:12 SteNicholas

Performance between 1.8.1 and 1.10.1 has not changed substantially.

yawkat avatar Dec 11 '25 05:12 yawkat

@yawkat I am not questioning whether it is faster or not - I am sure it is, else @dbtsai wont be raising this PR :) Given the PR description, adding a reference, which makes this clear would help.

mridulm avatar Dec 11 '25 18:12 mridulm

@mridulm I found a perf report at https://github.com/yawkat/lz4-java/pull/3#issuecomment-3599886694, but without providing the data.

pan3793 avatar Dec 12 '25 07:12 pan3793

@mridulm @yawkat @dbtsai @dongjoon-hyun @SteNicholas I created #53453 to add an lz4 benchmark based on TPCDS catalog_sales.dat (SF1), the size is about 283M. My test results show lz4-java 1.10.1 is about 10~15% slower on lz4 compression than 1.8.0, and is about 5% slower on lz4 decompression even with migrating to suggested safeDecompressor (https://github.com/apache/spark/pull/53454)

pan3793 avatar Dec 12 '25 08:12 pan3793

The underlying lz4 library was updated in 1.9.0 so a performance difference is possible.

yawkat avatar Dec 12 '25 09:12 yawkat

While this is being merged, would you consider "setting spark.io.compression.codec to something other than lz4" as good advice for users potentially affected by the vulnerabilities?

As far as I saw, only this config key has LZ4 as default. Also, it seems like spark.eventLog.compression.codec is its own setting, even though the former config key mentions it being used for writing event logs. I guess this is historic and an error in the docs?

Dzeri96 avatar Dec 19 '25 11:12 Dzeri96