risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

storage: maintain L0 SST index in compute nodes

Open hzxa21 opened this issue 3 years ago • 2 comments

Background

In our system, the reader of the internal states is always the writer of the state. Currently, the storage LSM read path of internal states is unaware of this assumption. Due to the fact that L0 SSTs can overlap with each other and we use key range to filter SSTs, CN may end up fetching unused L0 SSTs on reads:

  1. State of the same table_id (i.e. key prefix) can be in different L0 SSTs since we can place different shards of an operator onto different CNs. In this case, a shard will hit many overlapping L0 SSTs on reads but in fact it only needs to read its own portion of L0 SSTs.
  2. State written by the same CN is batched into the same L0 SSTs, causing L0 SSTs to have wide key range. Therefore, only using key range to filter L0 SSTs can generate many false positives.

To solve this problem, we need to maintain some kind of L0 SST index to leverage the "reader of the internal states is always the writer of the state" assumption to improve L0 SST filtering.

Solution

After discussion with @skyzh, @st1page , @wenym1 and @Li0k, we have come up with some ideas:

  • CN stores the sst_id of the L0 SSTs it has uploaded to S3. We call it LocalL0SSTs.
  • On reads, we use the "intersection of LocalL0SSTs and L0 SSTs in HummockVersion" + L1-N SSTs in HummockVersion to construct a local version for reads. Existing SST filtering logic still applies.
  • On HummockVersion update, we update LocalL0SSTs by removing the deleted SSTs (due to compaction). Thanks to incremental version fetch, we can easily identify the deleted SSTs.

Under this solution, we will only use the full list of L0 SSTs in HummockVersion when LocalL0SSTs is absent: 1) failover, 2) batch query (if batch reader is not co-scheduled with streaming executors)

Further optimizations

  • L0->L0 compaction: we may prefer to constrain the input SSTs of L0->L0 compaction to be LocalL0SSTs to reduce read amplification on L0 SSTs. That means CN needs to trigger L0->L0 compaction providing LocalL0SSTs information to compaction manager. Also, I am not sure how this can be integrated with the sub-level design. Need further investigation.
  • LocalL0SSTs data structure: flat or map?

hzxa21 avatar Jul 21 '22 06:07 hzxa21

Note that wait_epoch will now need to download LocalL0SSTs. Need to take special care of the batch read process.

skyzh avatar Jul 21 '22 06:07 skyzh

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

github-actions[bot] avatar Sep 20 '22 02:09 github-actions[bot]

@Li0k Can you help port this to the new state store implementation?

hzxa21 avatar Dec 01 '22 06:12 hzxa21

Hi, folks, any updates on this?

fuyufjh avatar Aug 08 '23 10:08 fuyufjh

Hi, folks, any updates on this?

cc @Little-Wallace

Li0k avatar Aug 09 '23 05:08 Li0k

We have a new RFC proposed. I will remove this issue from milestone and re-add it when the RFC is finialized.

hzxa21 avatar Nov 08 '23 09:11 hzxa21

We have a new RFC proposed. I will remove this issue from milestone and re-add it when the RFC is finialized.

ok while I still think flat is better than map for LocalL0SSTs data structure

soundOfDestiny avatar Nov 09 '23 12:11 soundOfDestiny