materialize environmentd: gracefully terminate when fenced out by new gen

@jkosh44 I wanted to help out with the boring parts here of futzing with the error types sufficiently so that we can exit gracefully no matter where the catalog discovers its been fenced—i.e., this part of your proposal in https://github.com/MaterializeInc/materialize/issues/29199#issuecomment-2307873349:

Whenever the adapter/Coordinator/envd sees a DeployGenerationFence error, exit with exit code 0.

These parts are still TODO:

The durable catalog will store in memory the deploy generation it was started with.

In the durable catalog, instead of immediately returning a fence error, always drain all updates from the persist shard. If there is a deploy generation greater than the deploy generation stored in memory return a DeployGenerationFence error, otherwise return the original fence error.

I figured you'd be more excited (and much faster than I) at filling in those bits.

If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container.

Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production.

Fix https://github.com/MaterializeInc/materialize/issues/29199.

Motivation

This PR fixes a recognized bug.

Checklist

[x] This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
[x] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
[x] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
[x] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
[x] If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

Aug 24 '24 02:08 benesch

Unfortunately, it doesn't look like the error is resolved. I'll continue to look into it.

Aug 27 '24 20:08 jkosh44

@def- The bumped version upgrade test is passing (https://buildkite.com/materialize/nightly/builds/9278#019194c6-cda0-4361-941e-1b4fd070249d) but the legacy upgrade test and others are failing with the usual error.

legacy-upgrade-materialized2-1    | environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.1, make sure to upgrade the catalog one version forward at a time

Is it possible that those tests need to be upgraded too, or they won't pass until the next version is cut?

Aug 27 '24 21:08 jkosh44

Is it possible that those tests need to be upgraded too, or they won't pass until the next version is cut?

Correct, old versions will not get fixed instantly by this. But in Nightly I'm also seeing other a failure in Txn-wal fencing:
txn-wal-fencing-mz_first-1   | thread 'coordinator' panicked at src/adapter/src/util.rs:286:23: unable to confirm leadership: Catalog(Error { kind: Durable(Fence(Epoch { current_epoch: 2, fence_epoch: 3 })) })
Which seems to be new with this PR.

Ah thanks, I'll look into that.

Aug 27 '24 22:08 jkosh44

Is it possible that those tests need to be upgraded too, or they won't pass until the next version is cut?

Correct, old versions will not get fixed instantly by this. But in Nightly I'm also seeing other a failure in Txn-wal fencing:
txn-wal-fencing-mz_first-1   | thread 'coordinator' panicked at src/adapter/src/util.rs:286:23: unable to confirm leadership: Catalog(Error { kind: Durable(Fence(Epoch { current_epoch: 2, fence_epoch: 3 })) })
Which seems to be new with this PR.
Ah thanks, I'll look into that.

I'm still looking into this but my gut take is that we need to update the test. The test is supposed to do fencing and the test is still passing, it's just an unexpected error line.

Aug 28 '24 14:08 jkosh44

Ok, I'll push an ignore for that and retrigger the test: https://buildkite.com/materialize/nightly/builds/9299 Edit: there were already ignores for other fencing errors in this test, so this actually totally makes sense. Sorry that I missed that before.

Aug 28 '24 16:08 def-

Is it possible that those tests need to be upgraded too, or they won't pass until the next version is cut?

Correct, old versions will not get fixed instantly by this. But in Nightly I'm also seeing other a failure in Txn-wal fencing:
txn-wal-fencing-mz_first-1   | thread 'coordinator' panicked at src/adapter/src/util.rs:286:23: unable to confirm leadership: Catalog(Error { kind: Durable(Fence(Epoch { current_epoch: 2, fence_epoch: 3 })) })
Which seems to be new with this PR.
Ah thanks, I'll look into that.
I'm still looking into this but my gut take is that we need to update the test. The test is supposed to do fencing and the test is still passing, it's just an unexpected error line.

Ok, my gut take was mostly right. This is where we see the error in the logs

0dt-mz_new-1     | environmentd: 2024-08-27T22:17:44.863044Z  INFO mz_adapter::coord: checked caught-up status of collections compute_caught_up=true compute_hydrated=true
0dt-mz_new-1     | environmentd: 2024-08-27T22:17:44.863120Z  INFO mz_environmentd::deployment::preflight: all clusters hydrated
0dt-mz_new-1     | environmentd: 2024-08-27T22:17:44.863135Z  INFO mz_environmentd::deployment::preflight: announced as ready to promote; waiting for promotion
0dt-mz_new-1     | environmentd: 2024-08-27T22:17:45.852189Z  INFO mz_adapter::coord: not advancing timelines in read-only mode
0dt-mz_new-1     | environmentd: 2024-08-27T22:17:46.037946Z  INFO mz_environmentd::deployment::preflight: promoted; attempting takeover
0dt-mz_old-1     | thread 'coordinator' panicked at src/adapter/src/util.rs:286:23:
0dt-mz_old-1     | unable to confirm leadership: Catalog(Error { kind: Durable(Fence(Epoch { current_epoch: 2, fence_epoch: 3 })) })
0dt-mz_old-1     | stack backtrace:
0dt-mz_new-1     | environmentd: 2024-08-27T22:17:46.264725Z  WARN mz_environmentd::deployment::preflight: halting process: fenced out old deployment; rebooting as leader

So the new instance is purposely fencing out the old instance, and the error we see is a fencing error. This PR modified that error from a halt to a panic, which is likely why we're now seeing it trigger test failures. I'm still not 100% convinced that we should have done that. I think we may want to consider keeping halt and adding a third variant that causes an exit(0).

Aug 28 '24 16:08 jkosh44

So the new instance is purposely fencing out the old instance, and the error we see is a fencing error. This PR modified that error from a halt to a panic, which is likely why we're now seeing it trigger test failures. I'm still not 100% convinced that we should have done that. I think we may want to consider keeping halt and adding a third variant that causes an exit(0).

Why do we see a FenceError::Epoch instead of a FenceError::DeployGeneration? If we saw the latter instead of the former, we'd exit(0) instead of panicking.

Aug 28 '24 16:08 benesch

So the new instance is purposely fencing out the old instance, and the error we see is a fencing error. This PR modified that error from a halt to a panic, which is likely why we're now seeing it trigger test failures. I'm still not 100% convinced that we should have done that. I think we may want to consider keeping halt and adding a third variant that causes an exit(0).

Why do we see a FenceError::Epoch instead of a FenceError::DeployGeneration? If we saw the latter instead of the former, we'd exit(0) instead of panicking.

Opening the catalog looks like this:

Write new epoch to persist shard.
Write any upgrades to persist shard.
Write catalog version and deploy generation to persist shard.

So if the fencing happens after step (1) but before step (3), we'll see an epoch error instead of a deploy generation error. I've been meaning to move the deploy generation write into step (1), but it's not as simple as it sounds.

Aug 28 '24 17:08 jkosh44

Here are all the errors that return true for should_halt

StorageError::ResourceExhausted(_)
StorageError::CollectionMetadataAlreadyExists(_)
StorageError::PersistShardAlreadyInUse(_)
StorageError::TxnWalShardAlreadyExists
DurableCatalogError::Fence(_)
DurableCatalogError::IncompatibleDataVersion { .. }
DurableCatalogError::IncompatiblePersistVersion { .. }
DurableCatalogError::Proto(_)
DurableCatalogError::UpperMismatch { .. }

After I make the change to atomically write the epoch and deploy generation at the same time, then I'm OK with panicking on all catalog errors except for DurableCatalogError::Fence(FenceError::DeployGeneration{ .. }) (which would exit(0)). However, it's not really clear to me why those storage errors lead to a halt and if it's safe to convert those into a panic. @aljoscha any thoughts there?

Aug 28 '24 19:08 jkosh44

However, it's not really clear to me why those storage errors lead to a halt and if it's safe to convert those into a panic. @aljoscha any thoughts there?

I was wondering that too. My hunch is those errors causing a halt is a vestige of when the storage stash was separate from the catalog, and so transacting against the storage stash could be what informed you that you'd been fenced out. Now that we have everything in the one catalog, I think plausibly it's always safe to panic on those error variants, because we should never see them; we should always see a fencing error instead.

@aljoscha should absolutely double check that take though.

Aug 28 '24 20:08 benesch

StorageError::ResourceExhausted(_)

This one isn't explained by the separate storage stash theory. It's been marked as should_halt since @ParkMyCar added it in the initial webhook source PR (https://github.com/MaterializeInc/materialize/pull/20171). Maybe just an oversight?

Aug 28 '24 20:08 benesch

StorageError::ResourceExhausted

This definitely does not need to cause a halt, just an oversight that I originally had it return true.

Aug 28 '24 22:08 ParkMyCar

Here's my plan for this PR assuming @aljoscha confirms the above.

Merge https://github.com/MaterializeInc/materialize/pull/29249 with adds some retry logic to fencing internally to the catalog.
Complete and merge https://github.com/MaterializeInc/materialize/pull/29269, which builds on the above PR and combines the epoch and deploy generation into a single fence token.
Rebase this PR on main.
Convert the StorageErrors to all turn false for should_gracefully_terminate.
Merge this PR (which should be passing CI).

Aug 28 '24 23:08 jkosh44

@jkosh44 your latest set of changes here LGTM.

Aug 29 '24 20:08 benesch

Here are all the errors that return true for should_halt

StorageError::ResourceExhausted(_)

StorageError::CollectionMetadataAlreadyExists(_)

StorageError::PersistShardAlreadyInUse(_)

StorageError::TxnWalShardAlreadyExists

DurableCatalogError::Fence(_)

DurableCatalogError::IncompatibleDataVersion { .. }

DurableCatalogError::IncompatiblePersistVersion { .. }

DurableCatalogError::Proto(_)

DurableCatalogError::UpperMismatch { .. }

After I make the change to atomically write the epoch and deploy generation at the same time, then I'm OK with panicking on all catalog errors except for DurableCatalogError::Fence(FenceError::DeployGeneration{ .. }) (which would exit(0)). However, it's not really clear to me why those storage errors lead to a halt and if it's safe to convert those into a panic. @aljoscha any thoughts there?

I went through all the storage errors for which we halt. The three catalog-related ones (catalog metadata, persist shard, txn shard), should definitely be panics, because those represent usage errors.

Re ResourceExhausted, I think we can panic. AFAICT, this only exists so we can return a specific http error code here: https://github.com/MaterializeInc/materialize/blob/0ac4a9a3f1f5ea4dec5ced810976875bac66955b/src/environmentd/src/http/webhook.rs#L389.

But this shouldn't come up in other contexts where StorageError comes up, so should be fine to panic there. @ParkMyCar, above you mentioned that they "don't have to cause a halt". This sounded like you thought the alternative would be for them to be ignored gracefully. What Joe is proposing, though, is that we instead panic. That should still be 100% ok according to my comment above, because we still don't panic in the code paths relevant for webhooks.

Aug 30 '24 13:08 aljoscha

I'm convinced—thanks, Aljoscha! Merging and backporting into v0.115.1.

Aug 30 '24 15:08 benesch

materialize materialize copied to clipboard

environmentd: gracefully terminate when fenced out by new gen

Motivation

Checklist

materialize
materialize copied to clipboard