sui icon indicating copy to clipboard operation
sui copied to clipboard

[Checkpoint] wait for checkpoint service to stop during reconfig

Open mwtian opened this issue 1 year ago • 5 comments

Description

Currently during reconfig, CheckpointService tasks, including CheckpointBuilder and CheckpointAggregator, are notified to shut down. But reconfig does not wait for them to finish shutting down. There can be a race between the reconfig loop proceeding to drop the epoch db handle, while CheckpointBuilder tries to read from epoch db when creating a new checkpoint. The race can result in panics.

Test plan

CI Simulation


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • [ ] Protocol:
  • [ ] Nodes (Validators and Full nodes):
  • [ ] Indexer:
  • [ ] JSON-RPC:
  • [ ] GraphQL:
  • [ ] CLI:
  • [ ] Rust SDK:

mwtian avatar May 07 '24 22:05 mwtian

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 7, 2024 10:26pm
3 Ignored Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview May 7, 2024 10:26pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview May 7, 2024 10:26pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview May 7, 2024 10:26pm

vercel[bot] avatar May 07 '24 22:05 vercel[bot]

Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.

halfprice avatar May 10 '24 18:05 halfprice

Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.

I had the same question. It might be better to keep the pre-existing graceful shutdown logic, and do a while join_set.join_next().is_some() to wait for shutdown to complete

mystenmark avatar May 10 '24 21:05 mystenmark

I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.

mwtian avatar May 10 '24 21:05 mwtian

I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.

The difference is that the panic wipes all in memory state so we would only be dealing with crash-recovery issues. Still, the fact that we only do this at the end of the epoch is convincing to me. Between that and the fact that simtest would be able to find problems with this, I think its okay as is.

mystenmark avatar May 22 '24 03:05 mystenmark

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 23 '24 01:07 github-actions[bot]