materialize icon indicating copy to clipboard operation
materialize copied to clipboard

storage: relax since check in read-only mode

Open aljoscha opened this issue 1 year ago • 2 comments
trafficstars

Fixes #28885

With this change, we might now get panics in other places where we make assumptions about sinces/uppers that don't hold in read-only mode or when there are concurrent coordinator-like processes.

cc @benesch & @def-: As mentioned in #28885, this fixes the panic but we might discover new ones.

Motivation

Tips for reviewer

Checklist

  • [ ] This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • [ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • [ ] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • [ ] If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

aljoscha avatar Aug 19 '24 10:08 aljoscha

Triggering nightly: https://buildkite.com/materialize/nightly/builds/9168 Edit: Second try: https://buildkite.com/materialize/nightly/builds/9174

def- avatar Aug 19 '24 10:08 def-

How do we know this patch is consistent with everything else in the controller? This now admits creating a collection with a weird state (read frontier == [] and also without a dataflow installed). Ideally we shouldn't need to do this calculus. Strong preference that we return a "CollectionDeleted" and handle the fact that a collection was deleted in the adapter by either also applying the deletion or otherwise handle the fact that it is missing state updates there.

In a way, that is what we are currently doing with the panic in place:

  1. the envd process panics
  2. it's restarted
  3. on restart it reads latest catalog state, which doesn't contain the collection in question anymore
  4. bootstrap etc. proceeds succesfully

It would be a bigger task to change in-memory catalog state based on what the controller reports back, and in normal operations we don't want to do that: the catalog is the source of truth and then the controller doesn't concur with something that indicates a bug.

I still think since=[] is a valid state. It's weird, yes. But someone (some coordinator process) has allowed the since to go to empty, so here we are, and the controller can restart from that. It's just that it won't let anyone read from the collection, which is also correct. Plus, I think this scenario (empty since) can also happen once we allow concurrent adapter-like/coordinator-like processes, which we will in the future for use-case isolation.

aljoscha avatar Aug 19 '24 11:08 aljoscha