go-ethereum eth/catalyst: move block commit into its own go-routine to avoid deadlock

Closes https://github.com/ethereum/go-ethereum/issues/29475 . The provided test-case captures the issue and fails without the fix.

Apr 26 '24 00:04 jwasinger

Looking into this lockup a bit, I was wondering why the sealBlock -> forkChoiceUpdated -> pool.Sync() was happening, and I found this comment

		// If the beacon chain is ran by a simulator, then transaction insertion,
		// block insertion and block production will happen without any timing
		// delay between them. This will cause flaky simulator executions due to
		// the transaction pool running its internal reset operation on a back-
		// ground thread. To avoid the racey behavior - in simulator mode - the
		// pool will be explicitly blocked on its reset before continuing to the
		// block production below.
		if simulatorMode {
			if err := api.eth.TxPool().Sync(); err != nil {
				log.Error("Failed to sync transaction pool", "err", err)
				return valid(nil), engine.InvalidPayloadAttributes.With(err)
			}
		}

Seems to me that adding the spinoff goroutines in this PR is basically a hack around this particular thing that is intentionally put there. I may be wrong, haven't fully grokked the entire situation yet.

Apr 26 '24 11:04 holiman

I'm not sure there's a good way around this. It is clear that we need the call to pool.Sync when calling fcu in simulator mode because we want the tests to be deterministic. However, this will inevitably trigger a NewTxsEvent when concurrently sending transactions in zero-period mode, causing the deadlock.

Afaict, the two options we have here are:

institute this "hack" and keep the code footprint small. or
Remove the call to pool.Sync in fcu and change every test case which uses the simulated beacon to manually call pool.Sync after inserting txs.

May 28 '24 04:05 jwasinger

@jwasinger please look at this again

Jun 25 '24 13:06 fjl

Spent the day looking into this and trying to come up with some solution that doesn't involve manually calling txpool Sync in calls to fcu for dev mode zero-period, and realized that it seems unworkable without making large changes elsewhere in the codebase.

Also, I looked at trying to create some way to intercept the NewTxsEvent into a separate channel than the one read in api.loop, and feeding this notification back through to the channel read in api.loop (in some magical non-blocking way which involves a separate go-routine). I couldn't come up with a solution that I was sure is deadlock-free and has no possibility of leaving executable transactions/withdrawals dangling without being included.

Right now, the solution in this PR is the best I can think of.

I would like to improve it further by moving the txpool.Sync invocation outside of fcu and within dev-mode specific commit logic. I think this would help improve the readability of fcu with no apparent downside.

Jun 25 '24 23:06 jwasinger

There is a problem with the logic (that is also present in master): We only call Commit once when responding to newTxs. This could potentially leave executable txs dangling if they couldn't all be included in a block, and no new txs subsequently become executable.

I looked into fixing this by altering the commit logic to commit in a loop, until the miner returns an empty payload. However, this is nontrivial because at the point where we are building a payload, we have already read withdrawals from their channel, and we have to somehow re-include them.

Tbh, I'm not sure why the spamming test case included in this PR doesn't trigger this... Will try to brainstorm how to fix the issue more today.

Jun 28 '24 12:06 jwasinger

Yeah sounds like something that should be fixed.

Jun 28 '24 12:06 fjl

closing in favor of #30264

Aug 09 '24 09:08 jwasinger

go-ethereum go-ethereum copied to clipboard

eth/catalyst: move block commit into its own go-routine to avoid deadlock

go-ethereum
go-ethereum copied to clipboard