Refactor and optimise commit interrupt

Open manav2401 opened this issue 9 months ago • 0 comments

Description

Currently, in bor, we use an interrupt context to notify the miner and EVM modules to stop block building when we hit the 2s mark to prevent delayed block announcement. While the check is pretty trivial, we do it before running every OPCODE which can lead to it using good chunk of CPU when processing transactions. An internal benchmark revealed that it used around 40% of CPU (out of the total CPU used in block building loop) on increased (~100M) gas limit block.

This is because we check for context.Done() before running every opcode which can be non-trivial for a heavy block with too many transactions (and a lot more opcodes). We instead use a simple global atomic flag to toggle when the block building time is up and use this global flag in the EVM interpreter and miner loop to check if we want to interrupt or continue ahead.

This PR also refactors how commit interrupt is handled in worker - evm interaction and simplifies it to a very good extent.

Here's a very minimalistic benchmark for both the calls.

Context done approach:

func BenchmarkCtxDone(b *testing.B) {
	ctx := context.Background()
	ctx, cancel := context.WithTimeout(ctx, time.Hour)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		// Non‐blocking check of ctx.Done()
		select {
		case <-ctx.Done():
			// (won't happen in this benchmark)
		default:
		}
	}
	cancel()
}

Results:

cpu: VirtualApple @ 2.50GHz
BenchmarkCtxDone-8      100000000               10.34 ns/op            0 B/op          0 allocs/op

Atomic flag approach

func BenchmarkAtomic(b *testing.B) {
	ctx := context.Background()
	ctx, cancel := context.WithTimeout(ctx, time.Hour)

	var done atomic.Bool

	go func() {
		time.Sleep(time.Hour)
		done.Store(true)
	}()

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		if done.Load() {
			b.Fatal("context cancelled")
		}
	}
	cancel()
}

Results:

cpu: VirtualApple @ 2.50GHz
BenchmarkAtomic-8       1000000000               0.4130 ns/op          0 B/op          0 allocs/op

Time for each operation is 0.4130 ns in this approach v/s 10.34 ns in the previous one.

Changes

[x] Bugfix (non-breaking change that solves an issue)
[ ] Hotfix (change that solves an urgent issue, and requires immediate attention)
[ ] New feature (non-breaking change that adds functionality)
[ ] Breaking change (change that is not backwards-compatible and/or changes current functionality)
[ ] Changes only for a subset of nodes

Checklist

[x] I have added at least 2 reviewer or the whole pos-v1 team
[x] I have added sufficient documentation in code
[x] I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
[ ] Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
[ ] Includes RPC methods changes, and the Notion documentation has been updated

Testing

[ ] I have added unit tests
[ ] I have added tests to CI
[x] I have tested this code manually on local environment
[x] I have tested this code manually on remote devnet using express-cli
[ ] I have tested this code manually on amoy
[ ] I have created new e2e tests into express-cli

Manual tests

Tested it on shadow fork.

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

Jun 10 '25 13:06 manav2401