sui
                                
                                
                                
                                    sui copied to clipboard
                            
                            
                            
                        [Checkpoint] wait for checkpoint service to stop during reconfig
Description
Currently during reconfig, CheckpointService tasks, including CheckpointBuilder and CheckpointAggregator, are notified to shut down. But reconfig does not wait for them to finish shutting down. There can be a race between the reconfig loop proceeding to drop the epoch db handle, while CheckpointBuilder tries to read from epoch db when creating a new checkpoint. The race can result in panics.
Test plan
CI Simulation
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.
- [ ] Protocol:
 - [ ] Nodes (Validators and Full nodes):
 - [ ] Indexer:
 - [ ] JSON-RPC:
 - [ ] GraphQL:
 - [ ] CLI:
 - [ ] Rust SDK:
 
The latest updates on your projects. Learn more about Vercel for Git ↗︎
| Name | Status | Preview | Comments | Updated (UTC) | 
|---|---|---|---|---|
| sui-docs | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | May 7, 2024 10:26pm | 
3 Ignored Deployments
| Name | Status | Preview | Comments | Updated (UTC) | 
|---|---|---|---|---|
| multisig-toolkit | ⬜️ Ignored (Inspect) | Visit Preview | May 7, 2024 10:26pm | |
| sui-kiosk | ⬜️ Ignored (Inspect) | Visit Preview | May 7, 2024 10:26pm | |
| sui-typescript-docs | ⬜️ Ignored (Inspect) | Visit Preview | May 7, 2024 10:26pm | 
Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel make_checkpoint and aggregator's run_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.
Discussed this with @mwtian in the office the other day. Can we make sure it is actually safe to cancel
make_checkpointand aggregator'srun_and_notify, and won't leave inconsistent state in the memory? Maybe consider using fail_points to cancel these tasks while it is running, and see if it causes any errors.
I had the same question. It might be better to keep the pre-existing graceful shutdown logic, and do a while join_set.join_next().is_some() to wait for shutdown to complete
I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.
I was thinking that if canceling checkpoint creation is problematic, panic would cause inconsistency as well which I have not observed. The in-memory data seems per epoch and they do not seem to be concurrently accessed. I will add a fail point and if there is a strong preference, I will add the wait for processing to finish.
The difference is that the panic wipes all in memory state so we would only be dealing with crash-recovery issues. Still, the fact that we only do this at the end of the epoch is convincing to me. Between that and the fact that simtest would be able to find problems with this, I think its okay as is.
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.