sui [Checkpoint] wait for checkpoint service to stop during reconfig

Description

Currently during reconfig, CheckpointService tasks, including CheckpointBuilder and CheckpointAggregator, are notified to shut down. But reconfig does not wait for them to finish shutting down. There can be a race between the reconfig loop proceeding to drop the epoch db handle, while CheckpointBuilder tries to read from epoch db when creating a new checkpoint. The race can result in panics.

Test plan

CI Simulation

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

[ ] Protocol:
[ ] Nodes (Validators and Full nodes):
[ ] Indexer:
[ ] JSON-RPC:
[ ] GraphQL:
[ ] CLI:
[ ] Rust SDK:

May 07 '24 22:05 mwtian

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 7, 2024 10:26pm

3 Ignored Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	May 7, 2024 10:26pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	May 7, 2024 10:26pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	May 7, 2024 10:26pm

May 07 '24 22:05 vercel[bot]

Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.

May 10 '24 18:05 halfprice

Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.

I had the same question. It might be better to keep the pre-existing graceful shutdown logic, and do a while join_set.join_next().is_some() to wait for shutdown to complete

May 10 '24 21:05 mystenmark

I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.

May 10 '24 21:05 mwtian

I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.

The difference is that the panic wipes all in memory state so we would only be dealing with crash-recovery issues. Still, the fact that we only do this at the end of the epoch is convincing to me. Between that and the fact that simtest would be able to find problems with this, I think its okay as is.

May 22 '24 03:05 mystenmark

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jul 23 '24 01:07 github-actions[bot]

sui sui copied to clipboard

[Checkpoint] wait for checkpoint service to stop during reconfig

Description

Test plan

Release notes

sui
sui copied to clipboard