kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[KYUUBI #3865] Arrow-based result compression support

Open cfmcgrady opened this issue 2 years ago • 5 comments

Why are the changes needed?

to close #3865

Arrow-based result compression support

Local Test

Input dataset

spark.sql(
  """
    |select * from tpcds.sf10.catalog_sales limit 2000000
    |""".stripMargin)
  .write
  .mode("overwrite")
  .save("/tmp/parquet/tpcds/sf10/catalog_sales_2000000")

Query

cat /tmp/b.sql

select * from parquet.`/tmp/parquet/tpcds/sf10/catalog_sales_2000000`;

Beeline command

bin/beeline -u 'jdbc:hive2://${remote-ip}:10009/;' --hiveconf kyuubi.operation.result.codec=arrow --hiveconf kyuubi.operation.incremental.collect=true --hiveconf kyuubi.operation.result.compression.codec=gzip/lz4/zstd/snappy  -f /tmp/b.sql > /dev/null

Result

kyuubi.operation.result.compression.codec time spent (round 1) time spent (round 2)
lz4 116.563 seconds 114.017 seconds
snappy 117.992 seconds 122.061 seconds
zstd 107.774 seconds 108.832 seconds
gzip 143.361 seconds 141.129 seconds
none 183.002 seconds 169.164 seconds

cfmcgrady avatar Nov 29 '22 11:11 cfmcgrady

I suggest removing the snappy support, comparing to other formats, I don't see advantage of snappy.

  • gzip: high compression ratio, used widely, many languages and platforms have the built-in support.
  • lz4: the fastest one, also has pure java implementation.
  • zstd: the future-proof one, excellence in both speed and compression ratio.

pan3793 avatar Nov 30 '22 02:11 pan3793

Codecov Report

Merging #3877 (713d529) into master (599fb3c) will decrease coverage by 0.03%. The diff coverage is 40.44%.

@@             Coverage Diff              @@
##             master    #3877      +/-   ##
============================================
- Coverage     51.70%   51.66%   -0.04%     
  Complexity       13       13              
============================================
  Files           504      506       +2     
  Lines         28857    28943      +86     
  Branches       3975     3982       +7     
============================================
+ Hits          14920    14953      +33     
- Misses        12521    12572      +51     
- Partials       1416     1418       +2     
Impacted Files Coverage Δ
...he/kyuubi/jdbc/hive/KyuubiArrowQueryResultSet.java 0.00% <0.00%> (ø)
.../org/apache/kyuubi/jdbc/hive/KyuubiConnection.java 3.85% <0.00%> (-0.06%) :arrow_down:
...a/org/apache/kyuubi/jdbc/hive/KyuubiStatement.java 26.18% <0.00%> (-0.17%) :arrow_down:
...pache/kyuubi/jdbc/hive/arrow/CompressionCodec.java 0.00% <0.00%> (ø)
...yuubi/jdbc/hive/arrow/CompressionCodecFactory.java 0.00% <0.00%> (ø)
...uubi/engine/spark/operation/ExecuteStatement.scala 76.36% <33.33%> (-1.21%) :arrow_down:
...g/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 83.67% <92.30%> (+7.67%) :arrow_up:
...kyuubi/engine/spark/operation/SparkOperation.scala 84.53% <100.00%> (+0.49%) :arrow_up:
...in/scala/org/apache/kyuubi/config/KyuubiConf.scala 97.44% <100.00%> (-0.06%) :arrow_down:
...apache/kyuubi/engine/JpsApplicationOperation.scala 77.41% <0.00%> (-3.23%) :arrow_down:
... and 6 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Nov 30 '22 06:11 codecov-commenter

I suggest removing the snappy support, comparing to other formats, I don't see advantage of snappy.

  • gzip: high compression ratio, used widely, many languages and platforms have the built-in support.
  • lz4: the fastest one, also has pure java implementation.
  • zstd: the future-proof one, excellence in both speed and compression ratio.

SGTM, and removed snappy from the build-in support. End users can easily add snappy support as needed.

cfmcgrady avatar Nov 30 '22 06:11 cfmcgrady

I suggest removing the snappy support, comparing to other formats, I don't see advantage of snappy.

  • gzip: high compression ratio, used widely, many languages and platforms have the built-in support.
  • lz4: the fastest one, also has pure java implementation.
  • zstd: the future-proof one, excellence in both speed and compression ratio.

+1 for these 3 supported compression list, and lz4 is the best to be the default compression type.

bowenliang123 avatar Dec 02 '22 16:12 bowenliang123

Forwarding informations from the offline discussion:

  1. we plan to defer this feature after Spark Connect implemented arrow compression, because there are some different ways(e.g. arrow native, stream compression, block compression) to implement compression, we'd better to keep consistent w/ upstream.
  2. aircompressor is good option for client side, the roughly testing shows it has same decompression performance as lz4-java and zstd-jni, but there are some memory allocation issue on lz4 stream decompression.

pan3793 avatar Dec 03 '22 09:12 pan3793