reth icon indicating copy to clipboard operation
reth copied to clipboard

Flaky optimism p2p::can_sync test

Open Rjected opened this issue 1 year ago • 6 comments

A recent merge queue run failed in a way that indicates we have something flaky in p2p::can_sync: https://github.com/paradigmxyz/reth/actions/runs/9007839847/job/24748566483

The important logs:

2024-05-08T20:09:21.442257Z ERROR reth_node_builder::launch::common: Failed to build global thread pool: ThreadPoolBuildError { kind: GlobalPoolAlreadyInitialized }
2024-05-08T20:09:26.441518Z ERROR reth_node_builder::launch::common: Failed to build global thread pool: ThreadPoolBuildError { kind: GlobalPoolAlreadyInitialized }
2024-05-08T20:09:44.902706Z ERROR blockchain_tree: Reverting canonical chain failed with error: OptimisticTargetRevert(88)

We might need something lazy for the thread pool, or it might have to do with launching multiple nodes? Have not debugged this in depth

The error is thrown here: https://github.com/paradigmxyz/reth/blob/d46774411fee0802f4390e4f04b7184cdcdb3ea2/crates/node/builder/src/launch/common.rs#L110-L126

Rjected avatar May 08 '24 21:05 Rjected

we can simply remove the error log here

mattsse avatar May 08 '24 22:05 mattsse

we can simply remove the error log here

The test is timing out though, so we need to do something else to fix the test probably

Rjected avatar May 08 '24 22:05 Rjected

This is because you can't have more than one threadpool per process, I ran into this with one of the benchmarks we hadn't kept up to date, too. Not entirely sure what the best way to circumvent this is

Edit: Actually this is a bit different, but might be a similar error. I ran into this with Tokio (the benchmark tool spawned a tokio runtime, after which the function we called did the same)

onbjerg avatar May 08 '24 22:05 onbjerg

Actually the flake might be more related to

2024-05-08T20:09:44.902706Z ERROR blockchain_tree: Reverting canonical chain failed with error: OptimisticTargetRevert(88)

But I still need to repro

Rjected avatar May 08 '24 22:05 Rjected

Yeah reading the threadpoolbuilder docs, it returns an error but does not mention anything about it being fatal or stalling, so should be ok: https://docs.rs/rayon/latest/rayon/struct.ThreadPoolBuilder.html#method.build_global

onbjerg avatar May 08 '24 22:05 onbjerg

That specific error log is expected and part of the e2e test @Rjected ~~so I'm thinking it's really about the long duration of the test (20s if i recall)~~ unsure now on why its timing out

joshieDo avatar May 09 '24 00:05 joshieDo

@joshieDo do you mind taking a look at this? I haven't run into this in a while though, so it may just be difficult to repro

Rjected avatar May 22 '24 21:05 Rjected

This issue is stale because it has been open for 21 days with no activity.

github-actions[bot] avatar Jun 13 '24 01:06 github-actions[bot]

I believe it's not an issue anymore, please re-open if occurs again.

shekhirin avatar Jun 13 '24 12:06 shekhirin

Still an issue https://github.com/paradigmxyz/reth/actions/runs/9506077502/job/26202446729?pr=8808

Rjected avatar Jun 13 '24 20:06 Rjected