erigon Increased latency in eth calls

System information

Erigon version: 2.60.8

OS & Version: Alpine Linux v3.20 in K8s

Erigon Command (with flags/config):

erigon
--chain=bor-mainnet
--datadir=/home/erigon/persistence/data
--http.addr=0.0.0.0
--rpc.accessList=/home/erigon/acl-config/acl.json
--rpc.batch.limit=10000
--rpc.txfeecap=100
--http.api=eth,erigon,web3,net,debug,txpool,trace
--http.vhosts=*
--http.corsdomain=null
--http
--ws
--db.pagesize=16384
--ethash.dagdir=/home/erigon/persistence/dag
--maxpeers=100
--private.api.addr=0.0.0.0:9090
--private.api.ratelimit=640000
--rpc.returndata.limit=500000
--metrics
--metrics.addr=0.0.0.0
--healthcheck
--port=30303
--db.size.limit=16TB
--torrent.download.rate=512mb
--torrent.download.slots=5
--bor.milestone=true
--staticpeers
--http.timeouts.read=300s
--state.cache=8192
--pprof
--pprof.addr="0.0.0.0"
--pprof.port=6061
--database.verbosity=4
--http.trace
--verbosity=4
--log.dir.verbosity=4
--bor.heimdall=<url>

Chain/Network: Ethereum / Polygon / BSC

Expected behaviour

No spikes in eth calls

Actual behaviour

For a while now we’ve been witnessing random latency spikes in calls on:

eth_getBlockByNumber
eth_getBlockReceipts
eth_blockNumber
eth_getStorageAt
eth_call
eth_getCode
eth_getBalance
eth_getTransactionReceipt

We are trying to see why do these calls last for up to 20 seconds.

eth_blockNumber usually returns in less then a second.

This happens on Ethereum , Polygon and BSC on Mainnet and it happens on archive mode runs.

The bellow graphs are from archive Erigon running Polygon for the last 28 days

eth_getBlockByNumber

eth_getBlockByNumber

eth_call

eth_call

eth_blockNumber

eth_blockNumber

Our initial guess was that this happens because of high traffic but that doesn’t seem to be the case. We’ve had one peek in this range but thats it

traffic

We can confirm network and I/O are not bottlenecks in this case as we’ve throughly checked and other non-erigon nodes are operational.

Steps to reproduce the behaviour

To reproduce this, one can try to set up https://github.com/louislam/uptime-kuma and create a monitor, point it to one of the nodes with POST eth_blockNumber periodically. There should be random spikes happening on the graph from time to time.

Backtrace

Example for eth_blockNumber when spike happens in logs. From our monitoring, the request was initiated at 14:15:27, but as you can see from the logs it answerd at 14:15:35

[WARN] [10-09|14:15:20.647] [rpc] served                             conn=10.42.12.108:54318 method=eth_getBlockByNumber reqid=safe t=72.98µs err="Unknown block"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=136.71µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=151.45µs method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=264.19µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|14:15:20.841] Served                                   conn=10.42.12.247:56092 t=50.086419ms method=trace_replayBlockTransactions reqid=1728483320660 params="[\"0x3BEAD45\",[\"trace\"]]"
[INFO] [10-09|14:15:21.279] Served                                   conn=10.42.21.225:57404 t=128.86µs method=eth_getCode reqid=4 params="[\"0xd3082872F8B06073A021b4602e022d5A070d7cfC\",\"latest\"]"
[INFO] [10-09|14:15:26.952] Served                                   conn=[::1]:44863 t=48.18µs method=net_version reqid=67 params=[]
[INFO] [10-09|14:15:27.015] Served                                   conn=[::1]:43791 t=37.88µs method=net_version reqid=67 params=[]
[DBUG] [10-09|14:15:31.449] [bor.heimdall] Got new milestone         start=62827807 end=62827828
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.23.156:44640 t=5.41151045s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.24.232:45902 t=9.714510998s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.23.156:46988 t=13.782515824s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.26.189:45034 t=4.040338563s method=eth_getBlockByNumber reqid=1 params="[\"latest\",false]"
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.27.86:41460 t=2.945821208s method=eth_blockNumber reqid=1 params=[]
[WARN] [10-09|14:15:35.469] [rpc] served                             conn=10.42.21.225:47324 method=trace_callMany reqid=1 t=6.805521594s err=stopped
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.27.86:41454 t=4.235557869s method=eth_blockNumber reqid=1 params=[]
[WARN] [10-09|14:15:35.469] [rpc] served                             conn=10.42.21.225:57392 method=trace_callMany reqid=1 t=479.222164ms err=stopped

Another occurance

Initiated at :27 answered at :34

[DBUG] [10-09|11:12:19.449] [bor.heimdall] Got new milestone         start=62822670 end=62822690
[WARN] [10-09|11:12:20.204] [rpc] served                             conn=10.42.21.225:60922 method=eth_getBlockByNumber reqid=safe t=68.21µs err="Unknown block"
[INFO] [10-09|11:12:20.204] Served                                   conn=10.42.21.225:60922 t=117.21µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:20.205] Served                                   conn=10.42.21.225:60922 t=521.749µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|11:12:20.205] Served                                   conn=10.42.21.225:60922 t=1.229418ms method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[WARN] [10-09|11:12:20.291] [rpc] served                             conn=10.42.12.108:54318 method=eth_getBlockByNumber reqid=safe t=48.77µs err="Unknown block"
[INFO] [10-09|11:12:20.291] Served                                   conn=10.42.12.108:54318 t=86.75µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:20.291] Served                                   conn=10.42.12.108:54318 t=135.049µs method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[INFO] [10-09|11:12:20.292] Served                                   conn=10.42.12.108:54318 t=537.749µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|11:12:21.791] Served                                   conn=10.42.21.225:54890 t=1.507088ms method=eth_getBlockByNumber reqid=2646 params="[\"0x3be993c\",true]"
[INFO] [10-09|11:12:26.914] Served                                   conn=[::1]:41687 t=41.68µs method=net_version reqid=67 params=[]
[INFO] [10-09|11:12:26.978] Served                                   conn=[::1]:37025 t=40.17µs method=net_version reqid=67 params=[]
[DBUG] [10-09|11:12:31.450] [bor.heimdall] Got new milestone         start=62822670 end=62822690
[WARN] [10-09|11:12:34.795] [rpc] served                             conn=10.42.21.225:60922 method=eth_getBlockByNumber reqid=safe t=9.58801108s err="Unknown block"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.21.225:60922 t=9.58806791s method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.24.237:40436 t=9.453629739s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.24.232:56466 t=3.607566971s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|11:12:34.796] Served                                   conn=[::1]:35809 t=7.813439057s method=eth_getBlockByNumber reqid=1 params="[\"latest\", false]"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.21.225:54254 t=2.443219605s method=eth_blockNumber reqid=1 params=[]

Oct 11 '24 16:10 stefan-stojanovic-s

try --db.read.concurrency=9000

Oct 12 '24 07:10 AskAlexSharov

we've increased it before to --db.read.concurency=16384 but removed the flag as there was no significant results from that I will for reference do it on 9000 and bring the results on here in 1-2 days

Oct 12 '24 21:10 stefan-stojanovic-s

Then add —pprof:

go tool pprof -png http://127.0.0.1:6060/debug/pprof/profile/?seconds\=60 > cpu.png

go tool pprof -alloc_objects -png http://127.0.0.1:6060/debug/pprof/heap?seconds\=20 > mem.png

Oct 13 '24 10:10 AskAlexSharov

fyi still observing spikes with 9000 on db.read.concurrency

Here is pprof

CPU cpu

MEM mem

Oct 14 '24 12:10 stefan-stojanovic-s

I see probably it's happening when new file got created - in this case RPCDaemon now does "ReOpen" files - and it's not really necessary. We also working on it in https://github.com/erigontech/erigon/issues/12299 and heap-profiling a bit optimized here https://github.com/erigontech/erigon/pull/11710 So, we will fix it Erigon3 first - then will decide if it's easy to backport to Erigon2 or not.

Oct 15 '24 08:10 AskAlexSharov

good to know its spotted would you suggest using any workarounds in the meantime until this gets resolved ? can we downgrade nodes before "ReOpen" was introduced ?

Oct 15 '24 13:10 stefan-stojanovic-s

I think it’s very old behavior. Do you see it after recent upgrade?

Oct 16 '24 00:10 AskAlexSharov

@JkLondon hi.

Can you try to port your KnownCfg optimisation to Erigon2?

Oct 16 '24 00:10 AskAlexSharov

We did a node upgrade to 2.60.6 on 28th of August, so It doesn't look like its related to bumping up the version.

I can't say if its happening on versions before 2.60, but its happening for at least 3 months. It seems that it started getting a bit more spikey on second half of September. Not sure what these large spikes were tbh, but even if we ignore them it averages 7-10 seconds for eth_blockNumber call

However, I've noticed that in one of the Discord threads a guy was complaining in May that he received a lot of latency calls when he bumped from 2.58.2 to 2.60.0 and that he had to roll back to resolve hence my question for downgrading

Oct 16 '24 11:10 stefan-stojanovic-s

Thank you for finding it. Seems that support request was not solved. We will work on it.

Oct 16 '24 14:10 AskAlexSharov

@JkLondon hi.

Can you try to port your KnownCfg optimisation to Erigon2?

https://github.com/erigontech/erigon/pull/12351 have done this optimisation

Oct 17 '24 12:10 JkLondon

I will work on removing need of files re-open (on e3 first then will port to e2): https://github.com/erigontech/erigon/pull/12332

Oct 17 '24 14:10 AskAlexSharov

@AskAlexSharov Can we expect a release with the fix anytime soon?

Oct 24 '24 07:10 avinashbo

2.60.9 release happened

Oct 26 '24 03:10 AskAlexSharov

will test if the issue persists and get back in here

Oct 28 '24 07:10 stefan-stojanovic-s

Sorry guys, this takes longer then anticipated to test since the underlying base image also changed so we gotta incorporate that . Will have some info for this by EOW

Nov 05 '24 12:11 stefan-stojanovic-s

Thank you

Nov 08 '24 02:11 AskAlexSharov

Hey guys

We've basically tried to make 2.60.10 reduce the latency for across regions and here is what we found out and what we are battling against currently:

A set of our Ethereum nodes see the difference in ETH spikes getting reduced. These nodes have things in common => RPS is usually averaging under 100 RPS and their chaindata is on avg 2.3TB.
The other set of Ethereum nodes don't see any difference in spikes for eth_ calls. What these nodes have in common is RPS on average is higher then 100 RPS and their chaindata is 5.3TB +
None of the Polygon nodes see the difference in latency be it a higher RPS or lower RPS it stays the same.

This points honestly to blockchain size to be connected to this as I see Polygon also has big ledgers on our end ~ 9.9TB.

We were thinking of resyncing the nodes to get the latency up but honestly its not a small amount of work and want to see if that would be the correct step forward

Would appreciate ideas or suggestions

...and sorry for the hold up

Nov 20 '24 18:11 stefan-stojanovic-s

you can copy 2.3TB chaindata to nodes which have 5.3TB chaindata. it's easier than resync.
I see you have --rpc.batch.limit=10000. Maybe you can try increase/decrease --rpc.batch.concurrency?
are your metrics - produced by erigon? or you measure them outside of erigon?
you can try: run separated rpcdaemon (on same datadir): https://github.com/erigontech/erigon/tree/main/cmd/rpcdaemon#getting-started - it will reduce impact of other components (txpool, chain-tip sync, etc...) to rpc (it will have own limits, etc...).

Nov 26 '24 06:11 AskAlexSharov

thanks for suggestions and ideas Alex

Dedicating time to the first point currently. Got no Polygon nodes with "normal" ledger sizes all increased up to 9TBs of data. For Ethereum, I'll copy over deff

Metrics are produced by a proxy service. They just messure the time it took to get the answer from Erigon. Proxies are deployed in the same network as Erigons so the network latency is minimal

For rpc-batch-concurency: Got a metric that lets me filter by batch requests eg. check if the request was in a batch or not. The response time is the same even if the request is not in batch

For rpcdaemon : will take a look

Nov 28 '24 14:11 stefan-stojanovic-s

We've noticed improvments when replacing Ethereum ledgers, still not at its best comparable to geth, but its a lot better

wondering what causes these bloated ledgers ? is it debug calls ? Does this mean that each time we get a ledger that is bigger then average size, we have to download a new snapshot version or replace with non-bloated ones to improve latency ?

Dec 10 '24 23:12 stefan-stojanovic-s

may be 2 reasons:

long read transacion (debug_call - is not enough long) - doesn't allow to re-use free space. for example tx-leak: https://github.com/erigontech/erigon/pull/8596
commits without fsync: we also faced this in TxPool (can't find PR now) - but not in chaindata.

To catch slow RPC calls: --rpc.slow=10m To catch slow/leaked Ro Transactions: SLOW_TX=10m

download a new snapshot - only way to compact db: copy data to new db. for example by: integration mdbx_to_mdbx

Also can:

--sync.loop.prune.limit=10

Also can try Erigon3 - chaindata size there is 20gb. https://github.com/erigontech/erigon?tab=readme-ov-file#erigon3-changes-from-erigon2

Dec 17 '24 03:12 AskAlexSharov

thanks for insights, will get back to the team with this and discuss

small update - fresh Polygon DBs are not helping reduce the latency

Dec 27 '24 08:12 stefan-stojanovic-s

hey @AskAlexSharov a quick question => is Erigon3 production ready ?

Jan 08 '25 08:01 stefan-stojanovic-s

Erigon3 is more stable for Polygon chains than Erigon2. Because it's way-easier for dev team to have/debug/fix problems of BorMainnet on Erigon3 (much easier to get to chain-tip or re-sync).
It's higher chance to get performance issues support on Erigon3.
It supports all RPC methods. And P2P too.
It may greatly increase/reduce latency of Receipt/Log-related RPC methods - depends on your use-case.
Several NodeProviders confirmed that Erigon3 works for them.
We also have several validators on Erigon3.
We are on the way of beta release - for us it means "releases will become more backward compatible. now some releases may advise/require re-sync". But re-sync is fast if you have enough network bandwidth.
We enabled --internalcl by default. But we had work on externalcl support too - but maybe have some issue: https://github.com/erigontech/erigon/issues/13321

Differences: https://github.com/erigontech/erigon?tab=readme-ov-file#erigon3-changes-from-erigon2 Book: https://docs.erigon.tech/

Jan 08 '25 09:01 AskAlexSharov

@AskAlexSharov https://github.com/erigontech/erigon/issues/13320 this one should be also addressed and would be an easy fix, is there any progress to fix the issue?

Jan 08 '25 16:01 tornadocontrib

Let's make sure that the latency is good at least in E3.

Jan 16 '25 15:01 yperbasis

yes. seems E3 doesn't have such problem. 12h perf test (eth-mainnet):

thank you. closing for now.

Feb 17 '25 09:02 AskAlexSharov