spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48359][SQL] Built-in functions for Zstd compression and decompression

Open xi-db opened this issue 1 year ago • 8 comments

What changes were proposed in this pull request?

Some users are using UDFs for Zstd compression and decompression, which results in poor performance. If we provide native functions, the performance will be improved by compressing and decompressing just within the JVM.

Now, we are introducing three new built-in functions:

zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)

where

  • input: The binary value to compress or decompress.
  • level: Optional integer argument that represents the compression level. The compression level controls the trade-off between compression speed and compression ratio. The default level is 3. Valid values: between 1 and 22 inclusive
  • streaming_mode: Optional boolean argument that represents whether to use streaming mode to compress. 

Examples:

> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> SELECT string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL

These three built-in functions are also available in Python and Scala.

Why are the changes needed?

Users no longer need to use UDFs for Zstd compression and decompression; they can directly use built-in SQL functions to run within the JVM.

Does this PR introduce any user-facing change?

Yes, three SQL functions - zstd_compress, zstd_decompress, and try_zstd_decompress are introduced.

How was this patch tested?

Added new UT and E2E tests.

Was this patch authored or co-authored using generative AI tooling?

No.

xi-db avatar May 20 '24 15:05 xi-db

Instead of adding (de)compression functions for different codecs, how about adding the compression and decompression directly, like,

  • https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html#function_compress
  • https://learn.microsoft.com/en-us/sql/t-sql/functions/compress-transact-sql?view=sql-server-ver16

yaooqinn avatar May 21 '24 04:05 yaooqinn

Instead of adding (de)compression functions for different codecs, how about adding the compression and decompression directly, like,

  • https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html#function_compress
  • https://learn.microsoft.com/en-us/sql/t-sql/functions/compress-transact-sql?view=sql-server-ver16

Hi @yaooqinn, yes, that can be one way of implementing them. However, based on the following,

  • The compress methods in MySQL and SQL Server only accept one argument and users can't specify the compression algorithm or compression level. Besides, the compression algorithm used in MySQL's compress is not specified, and SQL Server only uses gzip, which is different from our cases. This may cause confusion for users who are familiar with other databases when using compress function in Apache Spark if we reuse the same name.
  • Looking at our SQL Function Reference, there is no precedent for integrating multiple algorithms into one SQL function, which might make the functions more complicated to use. Following the naming convention like aes_encrypt, url_encode and regexp_replace, this function is named zstd_compress, including the algorithm name.

Thus, the functions are named zstd_compress, zstd_decompress, and try_zstd_decompress in this PR, explicitly showing the algorithm they use, to make them simple to understand and use.

xi-db avatar May 21 '24 08:05 xi-db

The compress methods in MySQL and SQL Server only accept one argument and users can't specify the compression algorithm or compression level. Besides, the compression algorithm used in MySQL's compress is not specified, and SQL Server only uses gzip, which is different from our cases. This may cause confusion for users who are familiar with other databases when using compress function in Apache Spark if we reuse the same name.

A parameter with a default value can achieve this. The default value can be either hard coded or configurable by session conf.

If zstd is replaced/dropped someday, we'd have to remove these functions first and cause a breaking change. I understand that it's unlikely to happen for 'zstd'. But what if we add compression functions in the same naming pattern for other existing compression codecs, will the possibility increase? And what if we add a new codec, do we need to add similar functions for self-consistency? Will it increase the maintenance cost?

Looking at our SQL Function Reference, there is no precedent for integrating multiple algorithms into one SQL function, which might make the functions more complicated to use. Following the naming convention like aes_encrypt, url_encode and regexp_replace, this function is named zstd_compress, including the algorithm name.

Most of the existing SQL functions are derived from other systems, Apache Hive, Postgres, MySQL, etc. AFAIK, Spark currently does not have such a naming convention itself, while 'supported by many other modern platforms' or 'defined in ANSI' are the rules we used mostly for adding new SQL functions

yaooqinn avatar May 21 '24 09:05 yaooqinn

cc @HyukjinKwon Could you take a look? Thanks.

xi-db avatar Aug 27 '24 16:08 xi-db

@yaooqinn has been reviewing this closely so I defer to him

HyukjinKwon avatar Aug 30 '24 01:08 HyukjinKwon

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Dec 09 '24 00:12 github-actions[bot]

I like specific functions for specific codecs, as it's easier to use. We can define an extra parameter for specifying the codec, but it's less readable and less friendly to editors with auto-completion. Plus, it's unlikely to support dynamic codec names, so an extra parameter won't bring much value.

cloud-fan avatar Jun 12 '25 22:06 cloud-fan

@xi-db can you rebase this PR?

cloud-fan avatar Jun 12 '25 22:06 cloud-fan

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Oct 17 '25 00:10 github-actions[bot]