risinglight icon indicating copy to clipboard operation
risinglight copied to clipboard

storage: better way to determine `fetch_size`

Open skyzh opened this issue 2 years ago • 5 comments

Currently, the logic is like:

https://github.com/risinglightdb/risinglight/blob/fea5e0b3de264e3f2275d1ca6c04c2c97ef8d639/src/storage/secondary/rowset/rowset_iterator.rs#L108-L122

If there is any hint from column iterators, we choose the minimum of remaining items N, so that we can issue as few I/O as possible for this round. This number N can be:

  • very large (e.g., >= 200k)
  • not accurate (e.g., there could be columns with remaining_items == 0, and they have to issue I/O for this round. We cannot determine the best fetch_size for now.)
  • not limited by ROWSET_MAX_OUTPUT. This value will only be used when there's no hint from any of the columns. The naming of this const is somehow misleading 🤪

We should find a better way to determine fetch_size. e.g., by removing fetch_hint interface and solely base on the column index information.

skyzh avatar Feb 19 '22 14:02 skyzh

My idea is that a fetch size vec corresponds to the fetch size for each column. if a certain size is 0, it will be skipped directly, and no io is needed. 🥺

Is that okay with you? @skyzh

eliasyaoyc avatar Oct 14 '22 09:10 eliasyaoyc

Could you please elaborate it in more details? Note that we must fetch equal size of rows for all columns.

skyzh avatar Oct 14 '22 15:10 skyzh

old -> [col1_fetch_size= 15, col2_fetch_size= 10, col3_fetch_size= 0] will use ROWSET_MAX_OUTPUT directly. my opinion is: [col1_fetch_size= 15, col2_fetch_size= 10, col3_fetch_size= 0] -> [col1_fetch_size= 10, col2_fetch_size= 10, col3_fetch_size= 10] so we get the minimum value is 10 and compare ROWSET_MAX_OUTPUT

eliasyaoyc avatar Oct 15 '22 02:10 eliasyaoyc

my opinion is: [col1_fetch_size= 15, col2_fetch_size= 10, col3_fetch_size= 0] -> [col1_fetch_size= 10, col2_fetch_size= 10, col3_fetch_size= 10]

That's exactly what I want! Feel free to send a PR. If all of them are 0 (or None), we can use ROWSET_MAX_OUTPUT directly.

skyzh avatar Oct 15 '22 02:10 skyzh

/assignme

eliasyaoyc avatar Oct 15 '22 02:10 eliasyaoyc