doris icon indicating copy to clipboard operation
doris copied to clipboard

[Opt](orc)Optimize the merge io when orc reader read multiple tiny stripes.

Open hubgeter opened this issue 1 year ago • 11 comments

Proposed changes

When reading orc files, we may encounter a scenario where the stripe byte size is very small but the number of stripes is very large.

This pr introduces three session variables orc_tiny_stripe_threshold, orc_once_max_read_size, and orc_max_merge_distance to optimize io reading for the above scenarios.

If a stripe byte size is less than orc_tiny_stripe_threshold, we will consider it as a tiny stripe. For multiple tiny stripes, we will perform IO merge reading according to the orc_once_max_read_size and orc_max_merge_distance parameters. Among them, orc_once_max_read_size indicates the maximum size of the merged IO. You should not set orc_once_max_read_size less than orc_tiny_stripe_threshold, although we will not force an error. When using tiny stripe reading optimization, since tiny stripes are not necessarily continuous, when the distance between two tiny stripes is greater than orc_max_merge_distance, we will not merge them into one IO.

If you don't want to use this optimization, you can set orc_tiny_stripe_threshold = 0.

Default parameters:

orc_tiny_stripe_threshold = 8388608 (8M)
orc_once_max_read_size = 8388608 (8M)
orc_max_merge_distance = 1048576 (1M)

We also add relevant profiles for this purpose so that parameters can be adjusted to optimize reading. RangeCacheFileReader:

  1. CacheRefreshCount: how many IOs are merged
  2. ReadToCacheBytes: how much data is actually read after merging
  3. ReadToCacheTime: how long it takes to read data after merging
  4. RequestBytes: how many bytes does the apache-orc library actually need to read the orc file
  5. RequestIO: how many times the apache-orc library calls this read interface
  6. RequestTime: how long it takes the apache-orc library to call this read interface

It should be noted that RangeCacheFileReader is a wrapper of the reader that actually reads data, such as the hdfs reader, so strictly speaking, CacheRefreshCount is not equal to how many IOs are initiated to hdfs, because each time the hdfs reader is requested, the hdfs reader may not be able to read all the data at once.

This pr also involves changes to the apache-orc third-party library: https://github.com/apache/doris-thirdparty/pull/244. Reference implementation: https://github.com/trinodb/trino/blob/master/lib/trino-orc/src/main/java/io/trino/orc/OrcDataSourceUtils.java#L36

Summary:

set orc_tiny_stripe_threshold = xxx;
set orc_once_max_read_size = xxx;
set orc_max_merge_distance = xxx;

# xxx is the size in bytes

hubgeter avatar Oct 17 '24 03:10 hubgeter

Thank you for your contribution to Apache Doris. Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website. See Doris Document.

doris-robot avatar Oct 17 '24 03:10 doris-robot

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 17 '24 03:10 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 18 '24 13:10 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 18 '24 15:10 github-actions[bot]

run buildall

hubgeter avatar Oct 18 '24 15:10 hubgeter

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 18 '24 15:10 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 37.44% (9711/25935) Line Coverage: 28.72% (80617/280718) Region Coverage: 28.16% (41717/148147) Branch Coverage: 24.75% (21211/85708) Coverage Report: http://coverage.selectdb-in.cc/coverage/a2b2845ee2287aa55ea3ebe16723c9816e6b93ae_a2b2845ee2287aa55ea3ebe16723c9816e6b93ae/report/index.html

doris-robot avatar Oct 18 '24 17:10 doris-robot

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 18 '24 18:10 github-actions[bot]

run buildall

hubgeter avatar Oct 18 '24 18:10 hubgeter

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 18 '24 18:10 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 37.44% (9710/25935) Line Coverage: 28.72% (80619/280723) Region Coverage: 28.16% (41717/148149) Branch Coverage: 24.75% (21213/85710) Coverage Report: http://coverage.selectdb-in.cc/coverage/0b64a8f60b8c440fa201b7d439bb879ede6e9970_0b64a8f60b8c440fa201b7d439bb879ede6e9970/report/index.html

doris-robot avatar Oct 18 '24 20:10 doris-robot

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 21 '24 13:10 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 21 '24 13:10 github-actions[bot]

run buildall

hubgeter avatar Oct 21 '24 13:10 hubgeter

TeamCity be ut coverage result: Function Coverage: 37.48% (9712/25912) Line Coverage: 28.71% (80583/280655) Region Coverage: 28.17% (41711/148095) Branch Coverage: 24.72% (21187/85692) Coverage Report: http://coverage.selectdb-in.cc/coverage/e7e92353292e859cd3cbd992ca01afc34be54bb2_e7e92353292e859cd3cbd992ca01afc34be54bb2/report/index.html

doris-robot avatar Oct 21 '24 15:10 doris-robot

run buildall

hubgeter avatar Oct 22 '24 01:10 hubgeter

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 22 '24 02:10 github-actions[bot]

run buildall

hubgeter avatar Oct 22 '24 03:10 hubgeter

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 22 '24 03:10 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 37.47% (9715/25928) Line Coverage: 28.72% (80629/280702) Region Coverage: 28.16% (41715/148119) Branch Coverage: 24.72% (21187/85698) Coverage Report: http://coverage.selectdb-in.cc/coverage/ecdec85325f788c59162f24fb0f5f292bd15b931_ecdec85325f788c59162f24fb0f5f292bd15b931/report/index.html

doris-robot avatar Oct 22 '24 03:10 doris-robot

run buildall

hubgeter avatar Oct 22 '24 06:10 hubgeter

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 22 '24 06:10 github-actions[bot]

PR approved by at least one committer and no changes requested.

github-actions[bot] avatar Oct 22 '24 08:10 github-actions[bot]

PR approved by anyone and no changes requested.

github-actions[bot] avatar Oct 22 '24 08:10 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 37.47% (9716/25928) Line Coverage: 28.72% (80616/280710) Region Coverage: 28.15% (41702/148118) Branch Coverage: 24.72% (21181/85698) Coverage Report: http://coverage.selectdb-in.cc/coverage/56c7be57881dd36b492cae96971f9cdc8af60706_56c7be57881dd36b492cae96971f9cdc8af60706/report/index.html

doris-robot avatar Oct 22 '24 08:10 doris-robot

run buildall

hubgeter avatar Oct 22 '24 10:10 hubgeter

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 22 '24 10:10 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 37.48% (9718/25931) Line Coverage: 28.74% (80609/280453) Region Coverage: 28.17% (41695/148016) Branch Coverage: 24.73% (21174/85638) Coverage Report: http://coverage.selectdb-in.cc/coverage/e801f98bba27e3d790647995f0a624fd049fb168_e801f98bba27e3d790647995f0a624fd049fb168/report/index.html

doris-robot avatar Oct 22 '24 12:10 doris-robot

run buildall

hubgeter avatar Oct 28 '24 03:10 hubgeter

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Oct 28 '24 03:10 github-actions[bot]