foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

SaveAndKill workload affects the clearing of healthyZoneKey key in MachineAttrition workload

Open jzhou77 opened this issue 2 years ago • 1 comments

In a restart test (-f ./foundationdb/tests/restarting/from_7.1.0/VersionVectorEnableRestart-1.toml -s 353535467 -b on at commit 0d0c72cc8 using clang), I found that SaveAndKill stops the simulator, which interrupts MachineAttrition workload and prevents ignoreSSFailuresForDuration from finishing. So, even though the invocation is marked as uncancellable https://github.com/apple/foundationdb/blob/0d0c72cc86180037021898769b30e396e1e157a7/fdbserver/workloads/MachineAttrition.actor.cpp#L340-L342, the actual behavior is that the actor is not finished. As a result, the last transaction in the actor that clears healthyZoneKey key was not executed.

This typically is not a problem, but in the specific restart, this has caused the second part to fail in ConsistencyCheck right after the fdbservers are started. The reason is one TSS is lagging behind its buddy SS, thus after the restart it peeked TLog and exited with reason PeekPoppedTLogData. The test failure comes from ConsistencyCheck, which checks storage queue for the now exited TSS and couldn't find it, because DD refused to remove the TSS due to the healthyZoneKey key being set.

The fix could be making ignoreSSFailuresForDuration really uncancellable even with MachineAttrition workload, which is unfortunately hard, because we need the simulator runLoop() to wait until ignoreSSFailuresForDuration actor finished.

jzhou77 avatar Sep 10 '22 18:09 jzhou77

Another idea is to fix the part 2, clearing the key there.

jzhou77 avatar Sep 12 '22 16:09 jzhou77

https://github.com/apple/foundationdb/pull/9840 fixed this issue.

jzhou77 avatar May 24 '23 04:05 jzhou77