nethermind icon indicating copy to clipboard operation
nethermind copied to clipboard

Unable to load a 1GB genesis file in 40 seconds in version 1.28.0.

Open lyfsn opened this issue 1 year ago • 4 comments

Description Our custom network uses a large 1GB genesis.json file, and it worked fine with versions before 1.28.0, such as 1.27.x.

However, after upgrading to version 1.28.0, my Nethermind node can't start and encountered this error:

26 Aug 02:44:18 | Snap serving enabled, but PruningBoundary is less than 128. Setting to 128. 
26 Aug 02:45:39 | Step LoadGenesisBlock         failed after 80976ms System.TimeoutException: Genesis block was not processed after 40 seconds
   at Nethermind.Init.Steps.LoadGenesisBlock.Load(IWorldState worldState) in /src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs:line 88
   at Nethermind.Init.Steps.LoadGenesisBlock.Execute(CancellationToken _) in /src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs:line 46
   at Nethermind.Init.Steps.EthereumStepsManager.ExecuteStep(IStep step, StepInfo stepInfo, CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 153
   at Nethermind.Init.Steps.EthereumStepsManager.InitializeAll(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 95
   at Nethermind.Runner.Ethereum.EthereumRunner.Start(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Runner/Ethereum/EthereumRunner.cs:line 36
   at Nethermind.Runner.Program.<>c__DisplayClass8_0.<<Run>b__1>d.MoveNext() in /src/Nethermind/Nethermind.Runner/Program.cs:line 213
26 Aug 02:45:39 | Error during ethereum runner start System.TimeoutException: Genesis block was not processed after 40 seconds
   at Nethermind.Init.Steps.LoadGenesisBlock.Load(IWorldState worldState) in /src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs:line 88
   at Nethermind.Init.Steps.LoadGenesisBlock.Execute(CancellationToken _) in /src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs:line 46
   at Nethermind.Init.Steps.EthereumStepsManager.ExecuteStep(IStep step, StepInfo stepInfo, CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 153
   at Nethermind.Init.Steps.EthereumStepsManager.InitializeAll(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 95
   at Nethermind.Runner.Ethereum.EthereumRunner.Start(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Runner/Ethereum/EthereumRunner.cs:line 36
   at Nethermind.Runner.Program.<>c__DisplayClass8_0.<<Run>b__1>d.MoveNext() in /src/Nethermind/Nethermind.Runner/Program.cs:line 213

Steps to Reproduce

  1. Generate a large genesis file of 1GB.
  2. Use this large genesis file to initialize and start the node.

Actual behavior The node can't start and logs a timeout of 40 seconds.

By the way, why is the 40s timeout hardcoded? https://github.com/NethermindEth/nethermind/blob/e856de5a33259ea0e54c40c28db37631bf56c2c0/src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs#L23

Expected behavior The node can start normally, just like in version 1.27.x.

Screenshots Screenshot 2024-08-26 at 10 57 09

Desktop (please complete the following information): Please provide the following information regarding your setup:

  • Operating System: Linux
  • Version: 1.28.0
  • Installation Method: Docker
  • Consensus Client: none

Additional context In my more precise testing, if the genesis file size exceeds 256MB, the node fails to start and times out while loading the genesis file.

My startup paramaters:

version: "3.9"
services:
  execution:
    tty: true
    environment:
    - TERM=xterm-256color
    - COLORTERM=truecolor
    stop_grace_period: 30s
    container_name: gas-execution-client
    image: ${EC_IMAGE_VERSION}
    networks:
    - gas
    volumes:
    - ${EC_DATA_DIR}:/nethermind/data
    - ${EC_JWT_SECRET_PATH}:/tmp/jwt/jwtsecret
    - ${CHAINSPEC_PATH}:/tmp/chainspec/chainspec.json
    ports:
    - "30304:30304/tcp"
    - "30304:30304/udp"
    - "8009:8009"
    - "8545:8545"
    - "8551:8551"
    expose:
    - 8545
    - 8551
    command:
    - --config=none.cfg
    - --Init.ChainSpecPath=/tmp/chainspec/chainspec.json
    - --datadir=/nethermind/data
    - --log=INFO
    - --JsonRpc.Enabled=true
    - --JsonRpc.Host=0.0.0.0
    - --JsonRpc.Port=8545
    - --JsonRpc.JwtSecretFile=/tmp/jwt/jwtsecret
    - --JsonRpc.EngineHost=0.0.0.0
    - --JsonRpc.EnginePort=8551
    - --Network.DiscoveryPort=30304
    - --HealthChecks.Enabled=true
    - --Metrics.Enabled=true
    - --Metrics.ExposePort=8009
    - --Sync.MaxAttemptsToUpdatePivot=0
    logging:
      driver: json-file
      options:
        max-size: 10m
        max-file: "10"
networks:
  gas:
    name: gas-network

Logs

lyfsn avatar Aug 26 '24 03:08 lyfsn

Can you share genesis file you are using?

LukaszRozmej avatar Aug 29 '24 08:08 LukaszRozmej

Can you share genesis file you are using?

In my test environment, I generate a random genesis file every time using this script, which creates many accounts in one genesis file.

For a quick test, this is a larger than 800MB genesis file of Endurance's mainnet. You could also try using this file: (But I haven't tried this file to see if it will produce the error. My error comes from the script method mentioned above.) https://github.com/OpenFusionist/network_config

lyfsn avatar Aug 29 '24 16:08 lyfsn

hi @LukaszRozmej For the above mentioned performance regression, I've done a further investigation and have some conclusions and points I'd like to further discuss

Regarding the performance issue: PR: https://github.com/NethermindEth/nethermind/pull/7215 was a performance optimization that replaced LruCache with ClockCache to reduce lock granularity. However, due to implementation details, it caused a regression that led to timeout issues when initializing large genesis files (>800M). The latest commit (60159fb448d5b7fd53565aa7b15942a8c68614ba) appears to have fixed this issue based on our tests.

Issue identification method:

  • Compared commits between two releases (1.27.1...1.28.0)
  • Locally compiled Nethermind and attempted to start it with a large genesis file
  • Used git bisect to gradually locate the problematic commit

Regarding the 40s hard-coded timeout: this has been previously discussed.Related PR: https://github.com/NethermindEth/nethermind/pull/6160. We can further discuss this issue:

It's up for discussion if we want to increase the timeout from 40 seconds (current default, hard-coded value) to something different.

Let me know if you need any additional information or clarification on this matter.

ohko4711 avatar Sep 18 '24 06:09 ohko4711

@ohko4711 thank you for the analysis. #7215 might have some unplanned effect though https://github.com/NethermindEth/nethermind/commit/60159fb448d5b7fd53565aa7b15942a8c68614ba shouldn't affect genesis based on the code, so not sure if it was this that could fix it. @benaadams can you check, both are your changes.

I will move the timeout to config though.

LukaszRozmej avatar Sep 18 '24 07:09 LukaszRozmej

@LukaszRozmej Anything more planned for this issue?

kamilchodola avatar Jan 10 '25 10:01 kamilchodola

We could explore some additional optimizations that would not deserialize genesis file to an object, but it is very low priority now, so I think we can close.

LukaszRozmej avatar Jan 10 '25 12:01 LukaszRozmej