concord-bft
concord-bft copied to clipboard
Remove the CRE (possibly infinite) loop from the end of state transfer.
-
Problem Overview
some of the reconfiguration commands may introduce a certain complexity with state transfer: they may break the ability to complete state transfer (and getting the reconfiguration update). A good example of such a command is the scaling command. The current solution is to have another reconfiguration mechanism within state transfer - based on the client reconfiguration engine. The rationale behind this approach is that a replica on state transfer behaves identically to a full copy client (read-only replica) and should get the reconfiguration updates with the same async mechanism as every other client. However, this means that in order to make sure we got the reconfiguration updates via CRE (the async mechanism) is that we have to make a synchronization phase. In the context of CRE, this was done by making sure we get a newer update than the latest one we got in state transfer. Unfortunately, getting a bft update requires getting replies from 2f + 1 replica (in the regular case), which means that to finish state transfer we must have other 2f + 1 replicas.
After carefully analyzing the problem, it appears we can have the two mechanisms as long as we sync between them.
We have two options:
- the replica was able to get the reconfiguration update prior to the end of state transfer. In this case, the replica will get the new configuration and start the state transfer all over again (because state transfer has not been finished). Then, once we actually finish state transfer in the new configuration, we won't execute the scale command again as it was written to the configurations file.
- The replica was not able to get the update with CRE and was able to complete state transfer. Then we simply execute the reconfiguration update as any other reconfiguration command. As before, we update the configurations file with the new configuration.
Notes:
- Removing the synchronization phase (the loop), means that every new reconfiguration command should be developed w.r.t to the above problem (there is no generic solution)
- The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.
-
Testing Done
CI
looks risky. We need to test State transfer scenarios
looks risky. We need to test State transfer scenarios
There are Apollo tests that test that. But scheduling a deterministic test for testing this scenario is not supported at the moment
The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.
@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP?
From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.
@yontyon
Are current tests good enough, or do we need to add new tests to check new side cases due to current modifications?
The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.
@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP?
From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.
@glevkovich , I'm referring to this piece of code:
// Mark ST final completion to "outside world" - After this call any incoming messages are not dropped.
setExternalCollectingState(false);
// Now after external state have changed, invoke the 2nd group of registered callbacks
// TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done.
// worst case we will have some few incoming messages dropped, until key-exchange is done.
// on_fetching_state_change_cb_registry_ should be removed
auto size = on_fetching_state_change_cb_registry_.size();
LOG_INFO(
logger_,
"Starting to invoke all registered calls (on_fetching_state_change_cb_registry_):" << KVLOG(size, checkpointNum));
on_fetching_state_change_cb_registry_.invokeAll(checkpointNum);
LOG_INFO(logger_, "Done invoking all registered calls (on_fetching_state_change_cb_registry_)");
AFAIU, if the replica has crashed during on_fetching_state_change_cb_registry_
then it won't execute again the callbacks (because state transfer has already finished).
In addition, it's rather a general warning: Once you rely on two different mechanisms and one of them is implemented on the state transfer callback, the developer has to have in mind that its logic might be executed twice (in case the replica has crashed during the callback execution). Until now there was a general solution (the CRE loop), but now it is the developer's responsibility to handle all these edge cases.
@yontyon
Are current tests good enough, or do we need to add new tests to check new side cases due to current modifications?
To test the specific scenario, we need to somehow be able to exactly to crash a replica exactly where the loop was before and before its CRE was able to fetch the update. AFAIN, we don't have this capability at the moment. So good tests (IMO) are simply to see that the relevant tests do not crash (this PR run for 4 times already and all was good)
Hi @yontyon, can you please try the following (currently disabled) test with your PR? https://github.com/vmware/concord-bft/blob/master/tests/apollo/test_skvbc_restart_recovery.py#L495
I'd suggest to run it in a loop, as a confirmation that your changes here address the underlying test instability.
Thanks!
The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.
@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP? From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.@glevkovich , I'm referring to this piece of code:
// Mark ST final completion to "outside world" - After this call any incoming messages are not dropped. setExternalCollectingState(false); // Now after external state have changed, invoke the 2nd group of registered callbacks // TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done. // worst case we will have some few incoming messages dropped, until key-exchange is done. // on_fetching_state_change_cb_registry_ should be removed auto size = on_fetching_state_change_cb_registry_.size(); LOG_INFO( logger_, "Starting to invoke all registered calls (on_fetching_state_change_cb_registry_):" << KVLOG(size, checkpointNum)); on_fetching_state_change_cb_registry_.invokeAll(checkpointNum); LOG_INFO(logger_, "Done invoking all registered calls (on_fetching_state_change_cb_registry_)");
AFAIU, if the replica has crashed during
on_fetching_state_change_cb_registry_
then it won't execute again the callbacks (because state transfer has already finished).In addition, it's rather a general warning: Once you rely on two different mechanisms and one of them is implemented on the state transfer callback, the developer has to have in mind that its logic might be executed twice (in case the replica has crashed during the callback execution). Until now there was a general solution (the CRE loop), but now it is the developer's responsibility to handle all these edge cases.
@yontyon
Yes, ST is done at this stage. I wonder if we still need this callback here? After all, it is not part of ST (ST is done alrrady).
In current master code i've added a TODO:
// TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done.