Possible deadlock introduced in `anvil >= 1.1.0`
Component
Anvil
Have you ensured that all of these are up to date?
- [x] Foundry
- [ ] Foundryup
What version of Foundry are you on?
No response
What version of Foundryup are you on?
No response
What command(s) is the bug in?
No response
Operating System
None
Describe the bug
We are using anvil for e2e tests. Most of the tests use an empty testnet (i.e. default anvil) but some rely on a forked network (i.e. --fork-url <URL>). Our forked e2e tests started failing with v1.1.0 but worked fine with v1.0.0. The tests using empty testnets don't seem to be affected by the switch to the new version.
Unfortunately the errors always happen in CI but so far never when I ran the tests locally which makes debugging this more difficult. From what I can tell it seems like anvil can get stuck (perhaps a deadlock?) after receiving a evm_mine request (presumably only for forked setups).
This also looks like a race condition. When I ran anvil with RUST_LOG=debug which introduces some overhead due to excessive logging the issue happened less often.
Please see this PR for some more context of the investigation.
Given that this is probably a PITA to debug because reproducing this locally wasn't possible so far I could use our CI to do a git bisect to hopefully find the culprit. Please let me know if this is necessary since this will be quite time consuming.
hmm,
could you perhaps try to repro this with https://github.com/foundry-rs/foundry/commit/dc69097e7d355c991996068bef861354dc72cc55
and if unable to repro maybe with https://github.com/foundry-rs/foundry/commit/f474801c4ead0542f7d7bf6f6499fb3ebac3c6fe
which could be the culprit
@MartinquaXD the releases to pin in CI for commits above are nightly-54af8693b8fcc7a3f65136c1188e89661155955d if this works OK please pin to nightly-23191fbeccfcf901f7c28590cb962d9693373c21 and see if reproducible. thank you!
nightly-54af8693b8fcc7a3f65136c1188e89661155955d and nightly-23191fbeccfcf901f7c28590cb962d9693373c21 both did not reproduce the error.
@MartinquaXD thank you, could you also please check nightly-f0e24fb6cfc9018d986d3136ab00aa5077994994. thank you!
nightly-f0e24fb6cfc9018d986d3136ab00aa5077994994 reproduced the issue, although it's not the culprit since the release right before (nightly-bfb1cb6b118b7e07a1de0fdea64a6cf44eb3a174) already had the problem.
I see, would you be able to bisect / try different nightly versions to figure out which one is the culprit or can you share a way I can conduct such? thank you!
Not sure if this makes sense but the first release that consistently fails is this. It only only bumps a minor version of a dependency, though. The 2 versions with actual changes before that both work consistently (03-12, 03-14).
Seems like the change in that PR is only targeting forge which I suspect is completely separate from anvil so perhaps this also bumps some shared sub dependencies in the process which are the root cause here. 🤷♂
@MartinquaXD if not asking too much, could you please check also the builds from the one you identified (03-19) one to 03-14, even if they doesn't show any change in release notes there could be chores labeled to be ignored from release notes. thanks!
@MartinquaXD one more thing, do you happen to use wss rpc urls or https only?
Yeah, can check the releases between, too.
do you happen to use wss rpc urls or https only?
Only http.
@MartinquaXD does this happen with a specific provider / chain or not a factor here? Asking because could be tls related issues with a new dependency. Thanks!
We are using self hosted nodes on OVH as the --fork-url. For mainnet we are using reth and for gnosis chain erigon in case that matters.
Still looks like 03-19 is the culprit. All the 03-18 builds are working.
@MartinquaXD one more check pls, if you use latest nightly you see the same failure, correct?
Can you share the entire anvil command, wonder if there are other args but fork url
noticed you start with a block time of 12 and also some anvil utilities here https://github.com/cowprotocol/services/blob/e85dbfc7f4ea4e6504da5033b21a0b4b64af4188/crates/e2e/src/nodes/local_node.rs#L30 Can you say from the logs if there's any snapshot or revert (or any other anvil call) happening just before the issue? Would help a lot to compare outputs of a failed run and a good one, most likely a race condition from such a call sequence...
The latest nightly still has the issue.
This is a run which has the issue. To see the logs from anvil search for line=.
This is a successful run with v1.0.0.
The args for anvil are: --port 8545 --fork-url <SECRET_URL> --fork-block-number 20332745
The last RPC request anvil tries to serve is always evm_mine coming from this function. It calls evm_mine in a loop until some condition is met. Currently it waits 200ms between mining new block. When I increase the timeout to 1000ms the test passes successfully.
When I built a test that only mines blocks in a loop without any timeouts it worked fine locally.
So I suspect the issue happens when anvil is looking up some state to respond to other requests when a new evm_mine request comes in. Last thing I tried was logging all the requests we send and responses we get. This caused the tests to pass more consistently again but after 7 iterations I got this run which reproduced the error while logging all requests.
Looks like the request anvil handled during the evm_mine that got stuck was a batch request that fetched symbol(), decimals() and balanceOf(9008d19f58aabd9ed0d60971565aa8510560ab41) on the GROK token.
oh I have an idea what this could be
I can't confirm if #10471 fixes this behaviour but doing blocking stuff in request handling is always a bit iffy, forking mode makes this worse of course, after all the evm is still sync
I remember something similar where opening metamask would suddenly freeze the node, which was because mm sent a bunch of chainid requests that triggered something similar
I can't confirm if https://github.com/foundry-rs/foundry/pull/10471 fixes this behaviour but ...
Will rerun the tests when the fix hits nightly and will let you know. Thanks for jumping on this so quickly BTW.
@MartinquaXD the nightly build with potential fix is published, pls retry
The fix looks good. All our forked e2e tests ran 10 times without any errors. 🥳
Will this fix stay in nightly only or will there be a v.1.1.1 so we can switch back to the stable toolchain?
we plan to cut v1.2.0 release candidate on Monday, we're discussing if we should v1.1.1 too
fixed with #10471
@MartinquaXD be aware that latest nightlies are now on updated revm version and, while anvil logic was not changed, there's possibility of regression due to migration. Please let us know should you see any, thank you! Cc @zerosnacks
Hi @MartinquaXD, I noticed you reverted to pinning to v1.0.0 here: https://github.com/cowprotocol/services/blob/f364eb1c187e5b4227e02273a627a12c79dc1548/.github/workflows/pull-request.yaml#L144-L146 - are you still experiencing issues?
Yes, unfortunately we do. I wanted to git bisect the issue again but so far didn't get around to it.
It definitely seems odd since I ran the test suite repeatedly with the nightly branch that introduced the fix but with v1.2 we still see these flaky tests.
@MartinquaXD Hmm that is definitely odd, are you experiencing the exact same issue or a different issue? Just want to make sure this ticket description is still accurate. When time permits - if you have any additional pointers / logs / links to failed actions that would be great
@MartinquaXD Can you try adding --no-storage-caching? I've been experiencing a similar issue (also e2e tests), and adding --no-storage-caching works for me. Obviously this is not ideal, but at least it would give the foundry team more pointers :)
@MartinquaXD could you please help with checking the issue with latest nightly, we merged couple of improvements that we expect to solve the issue. thank you!