[Opt](orc)Optimize the merge io when orc reader read multiple tiny stripes.
Proposed changes
When reading orc files, we may encounter a scenario where the stripe byte size is very small but the number of stripes is very large.
This pr introduces three session variables orc_tiny_stripe_threshold, orc_once_max_read_size, and orc_max_merge_distance to optimize io reading for the above scenarios.
If a stripe byte size is less than orc_tiny_stripe_threshold, we will consider it as a tiny stripe. For multiple tiny stripes, we will perform IO merge reading according to the orc_once_max_read_size and orc_max_merge_distance parameters. Among them, orc_once_max_read_size indicates the maximum size of the merged IO. You should not set orc_once_max_read_size less than orc_tiny_stripe_threshold, although we will not force an error. When using tiny stripe reading optimization, since tiny stripes are not necessarily continuous, when the distance between two tiny stripes is greater than orc_max_merge_distance, we will not merge them into one IO.
If you don't want to use this optimization, you can set orc_tiny_stripe_threshold = 0.
Default parameters:
orc_tiny_stripe_threshold = 8388608 (8M)
orc_once_max_read_size = 8388608 (8M)
orc_max_merge_distance = 1048576 (1M)
We also add relevant profiles for this purpose so that parameters can be adjusted to optimize reading.
RangeCacheFileReader:
CacheRefreshCount: how many IOs are mergedReadToCacheBytes: how much data is actually read after mergingReadToCacheTime: how long it takes to read data after mergingRequestBytes: how many bytes does the apache-orc library actually need to read the orc fileRequestIO: how many times the apache-orc library calls this read interfaceRequestTime: how long it takes the apache-orc library to call this read interface
It should be noted that RangeCacheFileReader is a wrapper of the reader that actually reads data, such as the hdfs reader, so strictly speaking, CacheRefreshCount is not equal to how many IOs are initiated to hdfs, because each time the hdfs reader is requested, the hdfs reader may not be able to read all the data at once.
This pr also involves changes to the apache-orc third-party library: https://github.com/apache/doris-thirdparty/pull/244. Reference implementation: https://github.com/trinodb/trino/blob/master/lib/trino-orc/src/main/java/io/trino/orc/OrcDataSourceUtils.java#L36
Summary:
set orc_tiny_stripe_threshold = xxx;
set orc_once_max_read_size = xxx;
set orc_max_merge_distance = xxx;
# xxx is the size in bytes
Thank you for your contribution to Apache Doris. Don't know what should be done next? See How to process your PR
Since 2024-03-18, the Document has been moved to doris-website. See Doris Document.
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
run buildall
clang-tidy review says "All clean, LGTM! :+1:"
TeamCity be ut coverage result: Function Coverage: 37.44% (9711/25935) Line Coverage: 28.72% (80617/280718) Region Coverage: 28.16% (41717/148147) Branch Coverage: 24.75% (21211/85708) Coverage Report: http://coverage.selectdb-in.cc/coverage/a2b2845ee2287aa55ea3ebe16723c9816e6b93ae_a2b2845ee2287aa55ea3ebe16723c9816e6b93ae/report/index.html
clang-tidy review says "All clean, LGTM! :+1:"
run buildall
clang-tidy review says "All clean, LGTM! :+1:"
TeamCity be ut coverage result: Function Coverage: 37.44% (9710/25935) Line Coverage: 28.72% (80619/280723) Region Coverage: 28.16% (41717/148149) Branch Coverage: 24.75% (21213/85710) Coverage Report: http://coverage.selectdb-in.cc/coverage/0b64a8f60b8c440fa201b7d439bb879ede6e9970_0b64a8f60b8c440fa201b7d439bb879ede6e9970/report/index.html
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
run buildall
TeamCity be ut coverage result: Function Coverage: 37.48% (9712/25912) Line Coverage: 28.71% (80583/280655) Region Coverage: 28.17% (41711/148095) Branch Coverage: 24.72% (21187/85692) Coverage Report: http://coverage.selectdb-in.cc/coverage/e7e92353292e859cd3cbd992ca01afc34be54bb2_e7e92353292e859cd3cbd992ca01afc34be54bb2/report/index.html
run buildall
clang-tidy review says "All clean, LGTM! :+1:"
run buildall
clang-tidy review says "All clean, LGTM! :+1:"
TeamCity be ut coverage result: Function Coverage: 37.47% (9715/25928) Line Coverage: 28.72% (80629/280702) Region Coverage: 28.16% (41715/148119) Branch Coverage: 24.72% (21187/85698) Coverage Report: http://coverage.selectdb-in.cc/coverage/ecdec85325f788c59162f24fb0f5f292bd15b931_ecdec85325f788c59162f24fb0f5f292bd15b931/report/index.html
run buildall
clang-tidy review says "All clean, LGTM! :+1:"
PR approved by at least one committer and no changes requested.
PR approved by anyone and no changes requested.
TeamCity be ut coverage result: Function Coverage: 37.47% (9716/25928) Line Coverage: 28.72% (80616/280710) Region Coverage: 28.15% (41702/148118) Branch Coverage: 24.72% (21181/85698) Coverage Report: http://coverage.selectdb-in.cc/coverage/56c7be57881dd36b492cae96971f9cdc8af60706_56c7be57881dd36b492cae96971f9cdc8af60706/report/index.html
run buildall
clang-tidy review says "All clean, LGTM! :+1:"
TeamCity be ut coverage result: Function Coverage: 37.48% (9718/25931) Line Coverage: 28.74% (80609/280453) Region Coverage: 28.17% (41695/148016) Branch Coverage: 24.73% (21174/85638) Coverage Report: http://coverage.selectdb-in.cc/coverage/e801f98bba27e3d790647995f0a624fd049fb168_e801f98bba27e3d790647995f0a624fd049fb168/report/index.html
run buildall
clang-tidy review says "All clean, LGTM! :+1:"