nimbus-eth1 icon indicating copy to clipboard operation
nimbus-eth1 copied to clipboard

Answering: "What happens today when we run `nimbus --mainnet`?"'

Open jlokier opened this issue 2 years ago • 0 comments

This meta issue is to write down a few observations when we run nimbus --mainnet (now changed to nimbus --network=mainnet).

Nobody tested nimbus --mainnet with the serious goal of completing sync to head yet. It was known to take a lot of space and be I/O intensive. People said it would take "about 2 TB". So I thought, these days we can afford that and decided to try it out properly, to see what happens for real so we're not guessing. Boy was 2 TB an underestimate. At block 6000000 (44.69%) and 4.1 TB storage used, I stopped. Total estimated space is 9-15 TB. (A similar order of magnitude as Geth or OpenEthereum in archive mode.)

Most issues are more easily found and fixed by syncing to Goerli first. We won't note those here. For those, see instead the related issue #862 'Answering: "What happens today when we run nimbus --goerli?"' which has a detailed list of issues, all of which affect Mainnet too. (See also related issues #688 "Sync to Mainnet" and #687 "Sync to Goerli".)

Issues which are specific to Mainnet should be filed individually and fixed one by one, outside this meta issue. Ideally we will file those issues and fixes, and update this meta issue to point to them.

Time and space required

It has proven useful to know a guideline for how much time and storage to expect, so a Mainnet sync can be replicated without going through the tedium of trial and error, disk full recovery efforts, etc.

Value
Base version tested 521f29c0 (2021-08-24 18:30:52 +0700)
(Later commits are required to complete, see issues in #862)
Time to sync 14 days 3 hours (up to block 6000000, 44.69%)
Storage space used 4.1 TB (up to block 6000000, 44.69%)
Projected to be 24.6 TB at head on 2022-01-13 (using Etherscan curve)
Test CPU AMD Ryzen 9 5950X 3.5 GHz
Test storage 3x RAID-0 NVMe SSD, 512k stripe size
Test network 1 Gbit/s internet, no NAT

To reach block 6000000 in a similar time, you will need to run nimbus --mainnet in a loop to auto-restart it when it crashes, and with enough storage space. The time shown above does not count stops during the test where Nimbus crashed and was later restarted after analysis, time to recover from disk full conditions, time spent syncing which was reverted to a clean storage snapshot, or time when sync progress was stopped due to one of the bugs affecting progress. (True calendar time for this test was 28 days 18 hours).

Total storage estimate

UPDATE 2022-01-13: The estimate total storage to reach Mainnet head at block 13993867 is 24.6 TB. This turned out to be considerably larger than the 9-15 TB initial estimate below. It was found by using Geth and OpenEthereum growth chart at Etherscan to estimate the space ratio growing from from block 6000000 to the head block.

Estimated total storage to reach Mainnet head at block 13425180 is 9-15 TB. I didn't have enough spare SSD to run Mainnet that far. The estimate is from extrapolating block 600000 / 4.1 TB to block 13425180 which gives 9.2 TB, and then adding more because experience with the smaller networks suggests the growth rate increases later in the chain. (Goerli grew from 396GB to 805GB between blocks 4792321 and 5631351.)

About the database size

The default prune mode is in operation, which is --prune:full, and in fact many state pruning events are performed. It's not possible to recover full state history from this database. This is not an "archive node", despite what the size may suggest.

Note: Especially to readers outside the core team, it's worth mentioning the database and sync method are being replaced by an Exciting New Design™🏝 that is much faster and smaller. This test and #862 (Goerli test) were done to examine the current status, systematically track every issue that shows up so we can address them, and get handy baseline measurements to compare against.

Issues specific to Mainnet

Issues listed in #862 'Answering: "What happens today when we run nimbus --goerli?"' that affect both Goerli and Mainnet aren't duplicated here.

Most differences are general things about scale and adversity, but these are not specific bugs:

  • The impracticality of testing syncing to the head of Mainnet at the moment.
  • The impracticality of testing real-time fully synced behaviour on Mainnet at the moment.
  • The transaction pool rate is much higher on Mainnet compared with Goerli.
  • There may be more client diversity, adverse clients, customised client versions, and low quality, poorly responding peers.
  • There is a financial incentive to hack network nodes and subvert the chain logic, which is not present on testnets.

Below is a consensus bug which was seen only on Mainnet, and prevented sync from progressing. Because testing only went a little past 6000000 (44.69%), there may be other logic issues that we have not detected at higher block numbers.

Points where bulk sync stopped (only seen on Mainnet)

  • Progress stopped at block 6000961. This block number was due to a consensus bug at block 6001128 (see next), and the batching logic in blockchain_sync which does 192 blocks at a time and aborts the whole batch when any block fails.

  • Consensus bug at block 6001128. This occurs on a CREATE or CREATE2 operation, but it is not the same bug as the one at Goerli block 5080941 (see #862).

    • Symptom:
      TRC 2021-09-29 15:13:21.532+01:00 Persisting blocks                  file=persist_blocks.nim:43 fromBlock=6000961 toBlock=6001152
      ...
      DBG 2021-09-29 15:14:35.925+01:00 gasUsed neq cumulativeGasUsed      file=process_block.nim:68 gasUsed=7999726 cumulativeGasUsed=7989726
      TRC 2021-09-29 15:14:35.925+01:00 peer disconnected                  file=blockchain_sync.nim:407 peer=<PEER:IP>
      
    • Seen at many blocks in the range 6001128..6001204. After that, the is not seen again up to the highest tested block, 6021120.
    • This bug is linked to writeContract logic which occurs on a CREATE or CREATE2 operation, but it is not the same bug as the one affecting Goerli block 5080941 (see #862) which is also linked to writeContract logic.
    • It is not fixed by the fix for Goerli block 5080941 in commit 6548ff98 "fixes CREATE/CREATE2's returndata bug", which changes handling of received returnData from calling a nested contract.
    • It is fixed accidentally by a different fix for Goerli block 5080941, which works by changing the logic in writeContract instead of handling of received returnData. Because of this overlap, the Mainnet consensus bug at 6001128 and Goerli consensus bug at 5080941 were thought to be the same bug at first.
    • Accidental fixes are not good. We need to understand it, and be sure the logic makes sense / conforms to specification.
    • The issue is connected to SELFDESTRUCT interaction with CREATE or CREATE2.
    • Filed as issue #868 "Gas usage consensus error at Mainnet block 6001128".

jlokier avatar Oct 19 '21 08:10 jlokier