go-ethereum
go-ethereum copied to clipboard
eth/catalyst: move block commit into its own go-routine to avoid deadlock
Closes https://github.com/ethereum/go-ethereum/issues/29475 . The provided test-case captures the issue and fails without the fix.
Looking into this lockup a bit, I was wondering why the sealBlock -> forkChoiceUpdated -> pool.Sync() was happening, and I found this comment
// If the beacon chain is ran by a simulator, then transaction insertion,
// block insertion and block production will happen without any timing
// delay between them. This will cause flaky simulator executions due to
// the transaction pool running its internal reset operation on a back-
// ground thread. To avoid the racey behavior - in simulator mode - the
// pool will be explicitly blocked on its reset before continuing to the
// block production below.
if simulatorMode {
if err := api.eth.TxPool().Sync(); err != nil {
log.Error("Failed to sync transaction pool", "err", err)
return valid(nil), engine.InvalidPayloadAttributes.With(err)
}
}
Seems to me that adding the spinoff goroutines in this PR is basically a hack around this particular thing that is intentionally put there. I may be wrong, haven't fully grokked the entire situation yet.
I'm not sure there's a good way around this. It is clear that we need the call to pool.Sync when calling fcu in simulator mode because we want the tests to be deterministic. However, this will inevitably trigger a NewTxsEvent when concurrently sending transactions in zero-period mode, causing the deadlock.
Afaict, the two options we have here are:
- institute this "hack" and keep the code footprint small. or
- Remove the call to
pool.Syncinfcuand change every test case which uses the simulated beacon to manually callpool.Syncafter inserting txs.
@jwasinger please look at this again
Spent the day looking into this and trying to come up with some solution that doesn't involve manually calling txpool Sync in calls to fcu for dev mode zero-period, and realized that it seems unworkable without making large changes elsewhere in the codebase.
Also, I looked at trying to create some way to intercept the NewTxsEvent into a separate channel than the one read in api.loop, and feeding this notification back through to the channel read in api.loop (in some magical non-blocking way which involves a separate go-routine). I couldn't come up with a solution that I was sure is deadlock-free and has no possibility of leaving executable transactions/withdrawals dangling without being included.
Right now, the solution in this PR is the best I can think of.
I would like to improve it further by moving the txpool.Sync invocation outside of fcu and within dev-mode specific commit logic. I think this would help improve the readability of fcu with no apparent downside.
There is a problem with the logic (that is also present in master): We only call Commit once when responding to newTxs. This could potentially leave executable txs dangling if they couldn't all be included in a block, and no new txs subsequently become executable.
I looked into fixing this by altering the commit logic to commit in a loop, until the miner returns an empty payload. However, this is nontrivial because at the point where we are building a payload, we have already read withdrawals from their channel, and we have to somehow re-include them.
Tbh, I'm not sure why the spamming test case included in this PR doesn't trigger this... Will try to brainstorm how to fix the issue more today.
Yeah sounds like something that should be fixed.
closing in favor of #30264