kyuubi
kyuubi copied to clipboard
[KYUUBI #3865] Arrow-based result compression support
Why are the changes needed?
to close #3865
Arrow-based result compression support
Local Test
Input dataset
spark.sql(
"""
|select * from tpcds.sf10.catalog_sales limit 2000000
|""".stripMargin)
.write
.mode("overwrite")
.save("/tmp/parquet/tpcds/sf10/catalog_sales_2000000")
Query
cat /tmp/b.sql
select * from parquet.`/tmp/parquet/tpcds/sf10/catalog_sales_2000000`;
Beeline command
bin/beeline -u 'jdbc:hive2://${remote-ip}:10009/;' --hiveconf kyuubi.operation.result.codec=arrow --hiveconf kyuubi.operation.incremental.collect=true --hiveconf kyuubi.operation.result.compression.codec=gzip/lz4/zstd/snappy -f /tmp/b.sql > /dev/null
Result
kyuubi.operation.result.compression.codec | time spent (round 1) | time spent (round 2) |
---|---|---|
lz4 | 116.563 seconds | 114.017 seconds |
snappy | 117.992 seconds | 122.061 seconds |
zstd | 107.774 seconds | 108.832 seconds |
gzip | 143.361 seconds | 141.129 seconds |
none | 183.002 seconds | 169.164 seconds |
I suggest removing the snappy support, comparing to other formats, I don't see advantage of snappy.
- gzip: high compression ratio, used widely, many languages and platforms have the built-in support.
- lz4: the fastest one, also has pure java implementation.
- zstd: the future-proof one, excellence in both speed and compression ratio.
Codecov Report
Merging #3877 (713d529) into master (599fb3c) will decrease coverage by
0.03%
. The diff coverage is40.44%
.
@@ Coverage Diff @@
## master #3877 +/- ##
============================================
- Coverage 51.70% 51.66% -0.04%
Complexity 13 13
============================================
Files 504 506 +2
Lines 28857 28943 +86
Branches 3975 3982 +7
============================================
+ Hits 14920 14953 +33
- Misses 12521 12572 +51
- Partials 1416 1418 +2
Impacted Files | Coverage Δ | |
---|---|---|
...he/kyuubi/jdbc/hive/KyuubiArrowQueryResultSet.java | 0.00% <0.00%> (ø) |
|
.../org/apache/kyuubi/jdbc/hive/KyuubiConnection.java | 3.85% <0.00%> (-0.06%) |
:arrow_down: |
...a/org/apache/kyuubi/jdbc/hive/KyuubiStatement.java | 26.18% <0.00%> (-0.17%) |
:arrow_down: |
...pache/kyuubi/jdbc/hive/arrow/CompressionCodec.java | 0.00% <0.00%> (ø) |
|
...yuubi/jdbc/hive/arrow/CompressionCodecFactory.java | 0.00% <0.00%> (ø) |
|
...uubi/engine/spark/operation/ExecuteStatement.scala | 76.36% <33.33%> (-1.21%) |
:arrow_down: |
...g/apache/spark/sql/kyuubi/SparkDatasetHelper.scala | 83.67% <92.30%> (+7.67%) |
:arrow_up: |
...kyuubi/engine/spark/operation/SparkOperation.scala | 84.53% <100.00%> (+0.49%) |
:arrow_up: |
...in/scala/org/apache/kyuubi/config/KyuubiConf.scala | 97.44% <100.00%> (-0.06%) |
:arrow_down: |
...apache/kyuubi/engine/JpsApplicationOperation.scala | 77.41% <0.00%> (-3.23%) |
:arrow_down: |
... and 6 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
I suggest removing the snappy support, comparing to other formats, I don't see advantage of snappy.
- gzip: high compression ratio, used widely, many languages and platforms have the built-in support.
- lz4: the fastest one, also has pure java implementation.
- zstd: the future-proof one, excellence in both speed and compression ratio.
SGTM, and removed snappy
from the build-in support. End users can easily add snappy
support as needed.
I suggest removing the snappy support, comparing to other formats, I don't see advantage of snappy.
- gzip: high compression ratio, used widely, many languages and platforms have the built-in support.
- lz4: the fastest one, also has pure java implementation.
- zstd: the future-proof one, excellence in both speed and compression ratio.
+1 for these 3 supported compression list, and lz4
is the best to be the default compression type.
Forwarding informations from the offline discussion:
- we plan to defer this feature after Spark Connect implemented arrow compression, because there are some different ways(e.g. arrow native, stream compression, block compression) to implement compression, we'd better to keep consistent w/ upstream.
- aircompressor is good option for client side, the roughly testing shows it has same decompression performance as lz4-java and zstd-jni, but there are some memory allocation issue on lz4 stream decompression.