foundry icon indicating copy to clipboard operation
foundry copied to clipboard

Possible deadlock introduced in `anvil >= 1.1.0`

Open MartinquaXD opened this issue 7 months ago • 29 comments

Component

Anvil

Have you ensured that all of these are up to date?

  • [x] Foundry
  • [ ] Foundryup

What version of Foundry are you on?

No response

What version of Foundryup are you on?

No response

What command(s) is the bug in?

No response

Operating System

None

Describe the bug

We are using anvil for e2e tests. Most of the tests use an empty testnet (i.e. default anvil) but some rely on a forked network (i.e. --fork-url <URL>). Our forked e2e tests started failing with v1.1.0 but worked fine with v1.0.0. The tests using empty testnets don't seem to be affected by the switch to the new version.

Unfortunately the errors always happen in CI but so far never when I ran the tests locally which makes debugging this more difficult. From what I can tell it seems like anvil can get stuck (perhaps a deadlock?) after receiving a evm_mine request (presumably only for forked setups). This also looks like a race condition. When I ran anvil with RUST_LOG=debug which introduces some overhead due to excessive logging the issue happened less often. Please see this PR for some more context of the investigation.

Given that this is probably a PITA to debug because reproducing this locally wasn't possible so far I could use our CI to do a git bisect to hopefully find the culprit. Please let me know if this is necessary since this will be quite time consuming.

MartinquaXD avatar May 07 '25 09:05 MartinquaXD

hmm,

could you perhaps try to repro this with https://github.com/foundry-rs/foundry/commit/dc69097e7d355c991996068bef861354dc72cc55

and if unable to repro maybe with https://github.com/foundry-rs/foundry/commit/f474801c4ead0542f7d7bf6f6499fb3ebac3c6fe

which could be the culprit

mattsse avatar May 07 '25 09:05 mattsse

@MartinquaXD the releases to pin in CI for commits above are nightly-54af8693b8fcc7a3f65136c1188e89661155955d if this works OK please pin to nightly-23191fbeccfcf901f7c28590cb962d9693373c21 and see if reproducible. thank you!

grandizzy avatar May 07 '25 10:05 grandizzy

nightly-54af8693b8fcc7a3f65136c1188e89661155955d and nightly-23191fbeccfcf901f7c28590cb962d9693373c21 both did not reproduce the error.

MartinquaXD avatar May 07 '25 11:05 MartinquaXD

@MartinquaXD thank you, could you also please check nightly-f0e24fb6cfc9018d986d3136ab00aa5077994994. thank you!

grandizzy avatar May 07 '25 11:05 grandizzy

nightly-f0e24fb6cfc9018d986d3136ab00aa5077994994 reproduced the issue, although it's not the culprit since the release right before (nightly-bfb1cb6b118b7e07a1de0fdea64a6cf44eb3a174) already had the problem.

MartinquaXD avatar May 07 '25 11:05 MartinquaXD

I see, would you be able to bisect / try different nightly versions to figure out which one is the culprit or can you share a way I can conduct such? thank you!

grandizzy avatar May 07 '25 12:05 grandizzy

Not sure if this makes sense but the first release that consistently fails is this. It only only bumps a minor version of a dependency, though. The 2 versions with actual changes before that both work consistently (03-12, 03-14).

Seems like the change in that PR is only targeting forge which I suspect is completely separate from anvil so perhaps this also bumps some shared sub dependencies in the process which are the root cause here. 🤷‍♂

MartinquaXD avatar May 07 '25 13:05 MartinquaXD

@MartinquaXD if not asking too much, could you please check also the builds from the one you identified (03-19) one to 03-14, even if they doesn't show any change in release notes there could be chores labeled to be ignored from release notes. thanks!

grandizzy avatar May 07 '25 14:05 grandizzy

@MartinquaXD one more thing, do you happen to use wss rpc urls or https only?

grandizzy avatar May 07 '25 14:05 grandizzy

Yeah, can check the releases between, too.

do you happen to use wss rpc urls or https only?

Only http.

MartinquaXD avatar May 07 '25 14:05 MartinquaXD

@MartinquaXD does this happen with a specific provider / chain or not a factor here? Asking because could be tls related issues with a new dependency. Thanks!

grandizzy avatar May 07 '25 14:05 grandizzy

We are using self hosted nodes on OVH as the --fork-url. For mainnet we are using reth and for gnosis chain erigon in case that matters.

MartinquaXD avatar May 07 '25 14:05 MartinquaXD

Still looks like 03-19 is the culprit. All the 03-18 builds are working.

MartinquaXD avatar May 07 '25 15:05 MartinquaXD

@MartinquaXD one more check pls, if you use latest nightly you see the same failure, correct?

grandizzy avatar May 07 '25 15:05 grandizzy

Can you share the entire anvil command, wonder if there are other args but fork url

grandizzy avatar May 07 '25 15:05 grandizzy

noticed you start with a block time of 12 and also some anvil utilities here https://github.com/cowprotocol/services/blob/e85dbfc7f4ea4e6504da5033b21a0b4b64af4188/crates/e2e/src/nodes/local_node.rs#L30 Can you say from the logs if there's any snapshot or revert (or any other anvil call) happening just before the issue? Would help a lot to compare outputs of a failed run and a good one, most likely a race condition from such a call sequence...

grandizzy avatar May 07 '25 15:05 grandizzy

The latest nightly still has the issue. This is a run which has the issue. To see the logs from anvil search for line=. This is a successful run with v1.0.0.

The args for anvil are: --port 8545 --fork-url <SECRET_URL> --fork-block-number 20332745 The last RPC request anvil tries to serve is always evm_mine coming from this function. It calls evm_mine in a loop until some condition is met. Currently it waits 200ms between mining new block. When I increase the timeout to 1000ms the test passes successfully. When I built a test that only mines blocks in a loop without any timeouts it worked fine locally.

So I suspect the issue happens when anvil is looking up some state to respond to other requests when a new evm_mine request comes in. Last thing I tried was logging all the requests we send and responses we get. This caused the tests to pass more consistently again but after 7 iterations I got this run which reproduced the error while logging all requests.

Looks like the request anvil handled during the evm_mine that got stuck was a batch request that fetched symbol(), decimals() and balanceOf(9008d19f58aabd9ed0d60971565aa8510560ab41) on the GROK token.

MartinquaXD avatar May 08 '25 20:05 MartinquaXD

oh I have an idea what this could be

mattsse avatar May 08 '25 20:05 mattsse

I can't confirm if #10471 fixes this behaviour but doing blocking stuff in request handling is always a bit iffy, forking mode makes this worse of course, after all the evm is still sync

I remember something similar where opening metamask would suddenly freeze the node, which was because mm sent a bunch of chainid requests that triggered something similar

mattsse avatar May 08 '25 20:05 mattsse

I can't confirm if https://github.com/foundry-rs/foundry/pull/10471 fixes this behaviour but ...

Will rerun the tests when the fix hits nightly and will let you know. Thanks for jumping on this so quickly BTW.

MartinquaXD avatar May 08 '25 20:05 MartinquaXD

@MartinquaXD the nightly build with potential fix is published, pls retry

grandizzy avatar May 09 '25 07:05 grandizzy

The fix looks good. All our forked e2e tests ran 10 times without any errors. 🥳 Will this fix stay in nightly only or will there be a v.1.1.1 so we can switch back to the stable toolchain?

MartinquaXD avatar May 09 '25 08:05 MartinquaXD

we plan to cut v1.2.0 release candidate on Monday, we're discussing if we should v1.1.1 too

grandizzy avatar May 09 '25 08:05 grandizzy

fixed with #10471

grandizzy avatar May 09 '25 08:05 grandizzy

@MartinquaXD be aware that latest nightlies are now on updated revm version and, while anvil logic was not changed, there's possibility of regression due to migration. Please let us know should you see any, thank you! Cc @zerosnacks

grandizzy avatar May 22 '25 13:05 grandizzy

Hi @MartinquaXD, I noticed you reverted to pinning to v1.0.0 here: https://github.com/cowprotocol/services/blob/f364eb1c187e5b4227e02273a627a12c79dc1548/.github/workflows/pull-request.yaml#L144-L146 - are you still experiencing issues?

zerosnacks avatar Jun 13 '25 08:06 zerosnacks

Yes, unfortunately we do. I wanted to git bisect the issue again but so far didn't get around to it. It definitely seems odd since I ran the test suite repeatedly with the nightly branch that introduced the fix but with v1.2 we still see these flaky tests.

MartinquaXD avatar Jun 13 '25 08:06 MartinquaXD

@MartinquaXD Hmm that is definitely odd, are you experiencing the exact same issue or a different issue? Just want to make sure this ticket description is still accurate. When time permits - if you have any additional pointers / logs / links to failed actions that would be great

zerosnacks avatar Jun 13 '25 09:06 zerosnacks

@MartinquaXD Can you try adding --no-storage-caching? I've been experiencing a similar issue (also e2e tests), and adding --no-storage-caching works for me. Obviously this is not ideal, but at least it would give the foundry team more pointers :)

onbjerg avatar Jun 13 '25 11:06 onbjerg

@MartinquaXD could you please help with checking the issue with latest nightly, we merged couple of improvements that we expect to solve the issue. thank you!

grandizzy avatar Jul 07 '25 10:07 grandizzy