foundationdb
foundationdb copied to clipboard
SaveAndKill workload affects the clearing of healthyZoneKey key in MachineAttrition workload
In a restart test (-f ./foundationdb/tests/restarting/from_7.1.0/VersionVectorEnableRestart-1.toml -s 353535467 -b on
at commit 0d0c72cc8 using clang), I found that SaveAndKill stops the simulator, which interrupts MachineAttrition workload and prevents ignoreSSFailuresForDuration
from finishing. So, even though the invocation is marked as uncancellable https://github.com/apple/foundationdb/blob/0d0c72cc86180037021898769b30e396e1e157a7/fdbserver/workloads/MachineAttrition.actor.cpp#L340-L342, the actual behavior is that the actor is not finished. As a result, the last transaction in the actor that clears healthyZoneKey
key was not executed.
This typically is not a problem, but in the specific restart, this has caused the second part to fail in ConsistencyCheck right after the fdbservers are started. The reason is one TSS is lagging behind its buddy SS, thus after the restart it peeked TLog and exited with reason PeekPoppedTLogData. The test failure comes from ConsistencyCheck, which checks storage queue for the now exited TSS and couldn't find it, because DD refused to remove the TSS due to the healthyZoneKey
key being set.
The fix could be making ignoreSSFailuresForDuration
really uncancellable even with MachineAttrition workload, which is unfortunately hard, because we need the simulator runLoop()
to wait until ignoreSSFailuresForDuration
actor finished.
Another idea is to fix the part 2, clearing the key there.
https://github.com/apple/foundationdb/pull/9840 fixed this issue.