lotus icon indicating copy to clipboard operation
lotus copied to clipboard

Splitstore race condition due to caching and reverts

Open Stebalien opened this issue 1 year ago • 5 comments

The splitstore may remove important state given the following sequence of events:

  1. Client syncs to tipset A at height X.
  2. Client switches to tipset B at height X.
  3. Splitstore starts garbage collecting.
  4. Client switches back to tipset A at height X.

In step 4, the client will not re-execute tipset A because it'll be in the cache so the state for tipset will not get re-written. The splitstore will fail to keep the state from tipset A because (a) it was not reachable from tipset B and (b) it was not written after garbage collection started.

This can lead to corrupted datastores with missing blocks, leading to state mismatches and sync failures when the splitstore is enabled.

Stebalien avatar Aug 28 '24 18:08 Stebalien

2024-09-03

During the triage we discussed if we could drop the cache (maybe once a day). But we need to investigate if this is feasible. @ZenGround0 you have a lot of knowledge about the Splitstore, do you know if this would be okay? And also, once we schedule some more time to tackle this issue, maybe pair up with another so we can do some knowledge share about splitstore

rjan90 avatar Sep 03 '24 21:09 rjan90

During the triage we discussed if we could drop the cache (maybe once a day).

Specifically, drop the state cache for all tipsets not on the canonical chain at the start of compaction. That way we have to recompute their state when switching to them, ensuring the splitstore sees that their state is live.

Stebalien avatar Sep 03 '24 21:09 Stebalien

@rjan90 I could probably help speed up with a pair and would like to do that. I am rusty though so I will need time to understand what's going on and the proposed solution.

ZenGround0 avatar Sep 10 '24 14:09 ZenGround0

From discussion st cache that maps from tipset keys to states

Possible simple solution -- always do Has check on the root cid of the state when we access the cache. This will force a splitstore read which should lock the subtree.

ZenGround0 avatar Mar 31 '25 15:03 ZenGround0

Really, we can just:

  1. Not return the cached value.
  2. Set recompute = true.
  3. Finish out the rest of the function.

Stebalien avatar Mar 31 '25 15:03 Stebalien