go-ethereum icon indicating copy to clipboard operation
go-ethereum copied to clipboard

Lag between tx inclusion and accessible state over RPC

Open lightclient opened this issue 7 months ago • 8 comments

I'm sometimes seeing a lag in updating of the state as of geth v1.15.8. I am sending a transaction that is validated and included in a block, I wait for its receipt and retrieve it, at this point I can get the block at latest and it returns to me the block with my transaction in it. But if I make these calls fast enough and add a call to eth_call or eth_getCode with the latest tag, it does not return to me the same state. If I send a successfully delegated 7702 transaction for example, and I wait for the receipt and then check eth_getCode, I will sometimes get 0x as my code. If I put this call on a quick loop it returns the correct code quickly thereafter.

If I change the block identifier in the eth_call or eth_getCode to be the blockhash from the transaction receipt, it then works every time. But latest should be the same state as that block.

I went back a few versions and noticed this change happened between versions v1.15.7 (no issues) and v1.15.8.

from @fselmo

--

I'm trying to repro (branch here), but no success yet.

lightclient avatar May 09 '25 22:05 lightclient

For context, this is the test that picked this up.

This is the PR that fixes this issue for us, using the blockNumber from the receipt instead of "latest".

fselmo avatar May 09 '25 22:05 fselmo

I played around with a script for a while and could not reproduce either until I passed in even just an empty datadir - then I was able to get a reproducible script going. The datadir seems to be an important piece. The script starts up geth each run on a new port and so can be run concurrently to see failures more easily.

I silenced any other errors other than the eth_getCode errors for now but they also seem possibly related to state. Some are insufficient funds for the acct which lead me to believe that the funding tx is going through a similar state lag before we send the 7702 tx.

import asyncio
import os
import re
import select
import subprocess
import tempfile
import time

from web3 import (
    AsyncWeb3,
    WebSocketProvider,
)


GETH_BIN = os.getenv("GETH_BINARY")

GREEN_DOT = "\033[32m.\033[0m"  # success
RED_X = "\033[31mx\033[0m"  # expected failure case
YELLOW_S = "\033[33ms\033[0m"  # unexpected failure, skip


class UnexpectedCode(Exception):
    pass


def get_geth_port(proc, timeout=20):
    start = time.time()
    pattern = re.compile(r"127\.0\.0\.1:(\d+)")
    while time.time() - start < timeout:
        rlist, _, _ = select.select([proc.stderr], [], [], 0.1)
        if not rlist:
            continue
        line = proc.stderr.readline()
        if not line:
            continue
        match = pattern.search(line)
        if match:
            port = int(match.group(1))
            if port not in {0, 80}:
                return port
    raise TimeoutError("Timeout waiting for port")


def cleanup(proc, tmp_dir):
    proc.terminate()
    try:
        proc.wait(timeout=5)
        if proc.stdout:
            proc.stdout.close()
        if proc.stderr:
            proc.stderr.close()
        tmp_dir.cleanup()
    except subprocess.TimeoutExpired:
        proc.kill()
        proc.wait()


async def main():
    tmp_dir = tempfile.TemporaryDirectory()
    datadir = os.path.join(tmp_dir.name, "datadir")
    os.makedirs(datadir, exist_ok=True)

    proc = subprocess.Popen(
        [
            GETH_BIN,
            "--datadir",  # passing in a `datadir` is required to reproduce
            datadir,
            "--dev",
            "--dev.period",
            "3",
            "--ws",
            "--ws.port",
            "0",  # let geth find an open port, then parse and connect
            "--ws.api",
            "admin,eth,web3,debug",
        ],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
    )

    try:
        port = get_geth_port(proc)
    except TimeoutError:
        # didn't parse port (if timeout too low), just skip this run
        print(YELLOW_S, end="", flush=True)
        cleanup(proc, tmp_dir)
        return

    async with AsyncWeb3(WebSocketProvider(f"ws://127.0.0.1:{port}")) as w3:
        try:
            accts = await w3.eth.accounts
            keyfile_acct = accts[0]

            # fund new acct
            acct = w3.eth.account.create()
            tx_hash = await w3.eth.send_transaction(
                {"from": keyfile_acct, "to": acct.address, "value": 10**22}
            )
            await w3.eth.wait_for_transaction_receipt(tx_hash)

            # send 7702 tx
            chain_id = await w3.eth.chain_id
            nonce = await w3.eth.get_transaction_count(acct.address)

            tx = {
                "chainId": chain_id,
                "nonce": nonce,
                "gas": 200_000,
                "maxFeePerGas": 10**11,
                "maxPriorityFeePerGas": 10**11,
                "to": acct.address,
                "value": 0,
                "accessList": [],
                "authorizationList": [
                    acct.sign_authorization(
                        {
                            "address": "0xdeadbeef00000000000000000000000000000000",
                            "chainId": chain_id,
                            "nonce": nonce + 1,
                        }
                    )
                ],
                "data": "0x",
            }
            signed = acct.sign_transaction(tx)
            tx_hash = await w3.eth.send_raw_transaction(signed.raw_transaction)
            await w3.eth.wait_for_transaction_receipt(tx_hash)
            code = await w3.eth.get_code(acct.address)  # the default block id is "latest"
            if not code.hex().startswith("ef0100deadbeef"):
                raise UnexpectedCode("Code was not set!")

            # clear code
            clear_tx = dict(tx)  # copy tx and replace relevant fields
            clear_tx["nonce"] = nonce + 2
            clear_tx["authorizationList"] = [
                acct.sign_authorization(
                    {
                        "chainId": chain_id,
                        "address": f"0x{'00' * 20}",
                        "nonce": nonce + 3,
                    }
                )
            ]
            signed_reset = acct.sign_transaction(clear_tx)
            clear_tx_hash = await w3.eth.send_raw_transaction(
                signed_reset.raw_transaction
            )
            await w3.eth.wait_for_transaction_receipt(clear_tx_hash)

            reset_code = await w3.eth.get_code(acct.address)
            if reset_code.hex() != "":
                raise UnexpectedCode("Code was not cleared!")
            print(GREEN_DOT, end="", flush=True)
        except UnexpectedCode as e:
            print(RED_X, end="", flush=True)
        except Exception as e:
            # skip other exceptions... these are much rarer + also seem related to
            # state (e.g. not funding the acct in time)
            print(YELLOW_S, end="", flush=True)
    cleanup(proc, tmp_dir)


if __name__ == "__main__":
    asyncio.run(main())

The following command will run the flow 80 times within 10 concurrent processes at any given time, which isn't a ton of load, but can adjust those values if need be.

$ pip install -U web3
$ seq 80 | xargs -n1 -P10 python /path/to/script/above

I get fairly consistent failures with this.

Image

fselmo avatar May 10 '25 08:05 fselmo

just encounter this as well, seems related to sync write option change

mmsqe avatar May 16 '25 16:05 mmsqe

seems related to sync write option change

I can confirm that reverting that commit locally and running the above script fixes the issue


Image

fselmo avatar May 16 '25 16:05 fselmo

Actually while I'm here, I tested current master and it seems to have this fixed.

So as long as it is known where the problem was and maybe there is a test for this to make sure it doesn't come back, this is probably safe to close @lightclient

fselmo avatar May 16 '25 16:05 fselmo

Actually while I'm here, I tested current master and it seems to have this fixed.

Not sure how intentional the change was toward this problem (definitely seems like it is related) but this is the PR that fixed it.

fselmo avatar May 16 '25 17:05 fselmo

@rjl493456442 any thoughts on this?

lightclient avatar May 16 '25 19:05 lightclient

Hey all, sorry for the chaos and thanks for catching it.

But as @fselmo said, the Pebble Syncmode write has already been replaced in https://github.com/ethereum/go-ethereum/pull/31703. Please use the latest master or wait a bit for the next release.

rjl493456442 avatar May 19 '25 02:05 rjl493456442

would like to work on this!

rose2221 avatar Jun 16 '25 23:06 rose2221

I can confirm this is now fixed in v1.16.0. Feel free to close this issue.

fselmo avatar Jun 28 '25 14:06 fselmo