quorum
quorum copied to clipboard
Panic randomly occurs on node shutdown, leading to unclean shutdown
Expected behaviour
Panic should not happen on normal node shutdown.
Actual behaviour
panic: sync: WaitGroup is reused before previous Wait has returned
randomly happens on node shutdown, leading to unclean shutdown and data loss on the node.
I stop one of the non-validator nodes once in a day to safely take a disk snapshot.
I have observed this panic message once in a month or two.
Steps to reproduce the behaviour
Launch a QBFT cluster and schedule a normal shutdown once in a day.
Sometimes panic: sync: WaitGroup is reused before previous Wait has returned
message appears on node shutdown, causing data loss on the node.
This might be related to https://github.com/ethereum/go-ethereum/issues/27509 and applying https://github.com/ethereum/go-ethereum/pull/27665 might help alleviating this issue.
It seems that this panic message is not completely random and there are some situations where its probability gets high, which means just repeating systemctl start
and systemctl stop
is not enough to reproduce this.
I suspect this is some kind of race condition and there needs to be enough dirty caches for that to happen.