foundationdb SaveAndKill workload affects the clearing of healthyZoneKey key in MachineAttrition workload

SaveAndKill workload affects the clearing of healthyZoneKey key in MachineAttrition workload

Open jzhou77 opened this issue 2 years ago • 1 comments

In a restart test (-f ./foundationdb/tests/restarting/from_7.1.0/VersionVectorEnableRestart-1.toml -s 353535467 -b on at commit 0d0c72cc8 using clang), I found that SaveAndKill stops the simulator, which interrupts MachineAttrition workload and prevents ignoreSSFailuresForDuration from finishing. So, even though the invocation is marked as uncancellable https://github.com/apple/foundationdb/blob/0d0c72cc86180037021898769b30e396e1e157a7/fdbserver/workloads/MachineAttrition.actor.cpp#L340-L342, the actual behavior is that the actor is not finished. As a result, the last transaction in the actor that clears healthyZoneKey key was not executed.

This typically is not a problem, but in the specific restart, this has caused the second part to fail in ConsistencyCheck right after the fdbservers are started. The reason is one TSS is lagging behind its buddy SS, thus after the restart it peeked TLog and exited with reason PeekPoppedTLogData. The test failure comes from ConsistencyCheck, which checks storage queue for the now exited TSS and couldn't find it, because DD refused to remove the TSS due to the healthyZoneKey key being set.

The fix could be making ignoreSSFailuresForDuration really uncancellable even with MachineAttrition workload, which is unfortunately hard, because we need the simulator runLoop() to wait until ignoreSSFailuresForDuration actor finished.

Sep 10 '22 18:09 jzhou77

Another idea is to fix the part 2, clearing the key there.

Sep 12 '22 16:09 jzhou77

https://github.com/apple/foundationdb/pull/9840 fixed this issue.

May 24 '23 04:05 jzhou77

foundationdb foundationdb copied to clipboard

SaveAndKill workload affects the clearing of healthyZoneKey key in MachineAttrition workload

foundationdb
foundationdb copied to clipboard