risinglight
risinglight copied to clipboard
storage: better way to determine `fetch_size`
Currently, the logic is like:
https://github.com/risinglightdb/risinglight/blob/fea5e0b3de264e3f2275d1ca6c04c2c97ef8d639/src/storage/secondary/rowset/rowset_iterator.rs#L108-L122
If there is any hint from column iterators, we choose the minimum of remaining items N, so that we can issue as few I/O as possible for this round. This number N can be:
- very large (e.g., >= 200k)
- not accurate (e.g., there could be columns with remaining_items == 0, and they have to issue I/O for this round. We cannot determine the best fetch_size for now.)
- not limited by
ROWSET_MAX_OUTPUT
. This value will only be used when there's no hint from any of the columns. The naming of this const is somehow misleading 🤪
We should find a better way to determine fetch_size
. e.g., by removing fetch_hint
interface and solely base on the column index information.
My idea is that a fetch size vec corresponds to the fetch size for each column. if a certain size is 0, it will be skipped directly, and no io is needed. 🥺
Is that okay with you? @skyzh
Could you please elaborate it in more details? Note that we must fetch equal size of rows for all columns.
old -> [col1_fetch_size= 15, col2_fetch_size= 10, col3_fetch_size= 0] will use ROWSET_MAX_OUTPUT
directly.
my opinion is: [col1_fetch_size= 15, col2_fetch_size= 10, col3_fetch_size= 0] -> [col1_fetch_size= 10, col2_fetch_size= 10, col3_fetch_size= 10] so we get the minimum value is 10 and compare ROWSET_MAX_OUTPUT
my opinion is: [col1_fetch_size= 15, col2_fetch_size= 10, col3_fetch_size= 0] -> [col1_fetch_size= 10, col2_fetch_size= 10, col3_fetch_size= 10]
That's exactly what I want! Feel free to send a PR. If all of them are 0 (or None), we can use ROWSET_MAX_OUTPUT
directly.
/assignme