ORC-XXX: Support orc.compression.zstd.workers
What changes were proposed in this pull request?
Why are the changes needed?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Evaluating if this has any benefits in ORC.
Thanks @dongjoon-hyun for doing this, this is also what I want to introduce this configuration after zstd-jni merge, like Spark and Parquet also have similar configurations.
Ya, indeed.
BTW, it seems that there is no perf gain with this so far. Interesting.
it seems that there is no perf gain with this so far
Based on the product environment verification of this PR, I tested orc.compression.zstd.workers 0, 6, 15, and 16, and there seems to be no difference.
Although Paruqet also provides options for the number of zstd workers.
https://github.com/apache/parquet-mr/blob/c82d5b471a558124b03e67759038661a046f5938/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/ZstandardCodec.java#L53-L54
https://facebook.github.io/zstd/zstd_manual.html
ZSTD_c_nbWorkers=400, /* Select how many threads will be spawned to compress in parallel.
* When nbWorkers >= 1, triggers asynchronous mode when invoking ZSTD_compressStream*() :
* ZSTD_compressStream*() consumes input and flush output if possible, but immediately gives back control to caller,
* while compression is performed in parallel, within worker thread(s).
* (note : a strong exception to this rule is when first invocation of ZSTD_compressStream2() sets ZSTD_e_end :
* in which case, ZSTD_compressStream2() delegates to ZSTD_compress2(), which is always a blocking call).
* More workers improve speed, but also increase memory usage.
* Default value is `0`, aka "single-threaded mode" : no worker is spawned,
* compression is performed inside Caller's thread, and all invocations are blocking */
Thank you for double-check. Ya, it seems that our implementation has some limitations or bug.
Apache Spark also has the ZStandardCodec implementation based on this zstd-jni and it shows 30% or 40% improvement in the micro-bencharmk.
- https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/core/benchmarks/ZStandardBenchmark-results.txt#L27-L47
I'm still digging this because I believe this should be a part of Apache ORC 2.0.0