Increased latency in eth calls
System information
Erigon version: 2.60.8
OS & Version: Alpine Linux v3.20 in K8s
Erigon Command (with flags/config):
erigon
--chain=bor-mainnet
--datadir=/home/erigon/persistence/data
--http.addr=0.0.0.0
--rpc.accessList=/home/erigon/acl-config/acl.json
--rpc.batch.limit=10000
--rpc.txfeecap=100
--http.api=eth,erigon,web3,net,debug,txpool,trace
--http.vhosts=*
--http.corsdomain=null
--http
--ws
--db.pagesize=16384
--ethash.dagdir=/home/erigon/persistence/dag
--maxpeers=100
--private.api.addr=0.0.0.0:9090
--private.api.ratelimit=640000
--rpc.returndata.limit=500000
--metrics
--metrics.addr=0.0.0.0
--healthcheck
--port=30303
--db.size.limit=16TB
--torrent.download.rate=512mb
--torrent.download.slots=5
--bor.milestone=true
--staticpeers
--http.timeouts.read=300s
--state.cache=8192
--pprof
--pprof.addr="0.0.0.0"
--pprof.port=6061
--database.verbosity=4
--http.trace
--verbosity=4
--log.dir.verbosity=4
--bor.heimdall=<url>
Chain/Network: Ethereum / Polygon / BSC
Expected behaviour
No spikes in eth calls
Actual behaviour
For a while now we’ve been witnessing random latency spikes in calls on:
- eth_getBlockByNumber
- eth_getBlockReceipts
- eth_blockNumber
- eth_getStorageAt
- eth_call
- eth_getCode
- eth_getBalance
- eth_getTransactionReceipt
We are trying to see why do these calls last for up to 20 seconds.
eth_blockNumber usually returns in less then a second.
This happens on Ethereum , Polygon and BSC on Mainnet and it happens on archive mode runs.
The bellow graphs are from archive Erigon running Polygon for the last 28 days
eth_getBlockByNumber
eth_call
eth_blockNumber
Our initial guess was that this happens because of high traffic but that doesn’t seem to be the case. We’ve had one peek in this range but thats it
We can confirm network and I/O are not bottlenecks in this case as we’ve throughly checked and other non-erigon nodes are operational.
Steps to reproduce the behaviour
To reproduce this, one can try to set up https://github.com/louislam/uptime-kuma and create a monitor, point it to one of the nodes with POST eth_blockNumber periodically. There should be random spikes happening on the graph from time to time.
Backtrace
Example for eth_blockNumber when spike happens in logs. From our monitoring, the request was initiated at 14:15:27, but as you can see from the logs it answerd at 14:15:35
[WARN] [10-09|14:15:20.647] [rpc] served conn=10.42.12.108:54318 method=eth_getBlockByNumber reqid=safe t=72.98µs err="Unknown block"
[INFO] [10-09|14:15:20.647] Served conn=10.42.12.108:54318 t=136.71µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|14:15:20.647] Served conn=10.42.12.108:54318 t=151.45µs method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[INFO] [10-09|14:15:20.647] Served conn=10.42.12.108:54318 t=264.19µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|14:15:20.841] Served conn=10.42.12.247:56092 t=50.086419ms method=trace_replayBlockTransactions reqid=1728483320660 params="[\"0x3BEAD45\",[\"trace\"]]"
[INFO] [10-09|14:15:21.279] Served conn=10.42.21.225:57404 t=128.86µs method=eth_getCode reqid=4 params="[\"0xd3082872F8B06073A021b4602e022d5A070d7cfC\",\"latest\"]"
[INFO] [10-09|14:15:26.952] Served conn=[::1]:44863 t=48.18µs method=net_version reqid=67 params=[]
[INFO] [10-09|14:15:27.015] Served conn=[::1]:43791 t=37.88µs method=net_version reqid=67 params=[]
[DBUG] [10-09|14:15:31.449] [bor.heimdall] Got new milestone start=62827807 end=62827828
[INFO] [10-09|14:15:35.469] Served conn=10.42.23.156:44640 t=5.41151045s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served conn=10.42.24.232:45902 t=9.714510998s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served conn=10.42.23.156:46988 t=13.782515824s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served conn=10.42.26.189:45034 t=4.040338563s method=eth_getBlockByNumber reqid=1 params="[\"latest\",false]"
[INFO] [10-09|14:15:35.469] Served conn=10.42.27.86:41460 t=2.945821208s method=eth_blockNumber reqid=1 params=[]
[WARN] [10-09|14:15:35.469] [rpc] served conn=10.42.21.225:47324 method=trace_callMany reqid=1 t=6.805521594s err=stopped
[INFO] [10-09|14:15:35.469] Served conn=10.42.27.86:41454 t=4.235557869s method=eth_blockNumber reqid=1 params=[]
[WARN] [10-09|14:15:35.469] [rpc] served conn=10.42.21.225:57392 method=trace_callMany reqid=1 t=479.222164ms err=stopped
Another occurance
Initiated at :27 answered at :34
[DBUG] [10-09|11:12:19.449] [bor.heimdall] Got new milestone start=62822670 end=62822690
[WARN] [10-09|11:12:20.204] [rpc] served conn=10.42.21.225:60922 method=eth_getBlockByNumber reqid=safe t=68.21µs err="Unknown block"
[INFO] [10-09|11:12:20.204] Served conn=10.42.21.225:60922 t=117.21µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:20.205] Served conn=10.42.21.225:60922 t=521.749µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|11:12:20.205] Served conn=10.42.21.225:60922 t=1.229418ms method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[WARN] [10-09|11:12:20.291] [rpc] served conn=10.42.12.108:54318 method=eth_getBlockByNumber reqid=safe t=48.77µs err="Unknown block"
[INFO] [10-09|11:12:20.291] Served conn=10.42.12.108:54318 t=86.75µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:20.291] Served conn=10.42.12.108:54318 t=135.049µs method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[INFO] [10-09|11:12:20.292] Served conn=10.42.12.108:54318 t=537.749µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|11:12:21.791] Served conn=10.42.21.225:54890 t=1.507088ms method=eth_getBlockByNumber reqid=2646 params="[\"0x3be993c\",true]"
[INFO] [10-09|11:12:26.914] Served conn=[::1]:41687 t=41.68µs method=net_version reqid=67 params=[]
[INFO] [10-09|11:12:26.978] Served conn=[::1]:37025 t=40.17µs method=net_version reqid=67 params=[]
[DBUG] [10-09|11:12:31.450] [bor.heimdall] Got new milestone start=62822670 end=62822690
[WARN] [10-09|11:12:34.795] [rpc] served conn=10.42.21.225:60922 method=eth_getBlockByNumber reqid=safe t=9.58801108s err="Unknown block"
[INFO] [10-09|11:12:34.795] Served conn=10.42.21.225:60922 t=9.58806791s method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:34.795] Served conn=10.42.24.237:40436 t=9.453629739s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|11:12:34.795] Served conn=10.42.24.232:56466 t=3.607566971s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|11:12:34.796] Served conn=[::1]:35809 t=7.813439057s method=eth_getBlockByNumber reqid=1 params="[\"latest\", false]"
[INFO] [10-09|11:12:34.795] Served conn=10.42.21.225:54254 t=2.443219605s method=eth_blockNumber reqid=1 params=[]
try --db.read.concurrency=9000
we've increased it before to --db.read.concurency=16384 but removed the flag as there was no significant results from that
I will for reference do it on 9000 and bring the results on here in 1-2 days
Then add —pprof:
go tool pprof -png http://127.0.0.1:6060/debug/pprof/profile/?seconds\=60 > cpu.png
go tool pprof -alloc_objects -png http://127.0.0.1:6060/debug/pprof/heap?seconds\=20 > mem.png
fyi still observing spikes with 9000 on db.read.concurrency
Here is pprof
CPU
MEM
I see probably it's happening when new file got created - in this case RPCDaemon now does "ReOpen" files - and it's not really necessary. We also working on it in https://github.com/erigontech/erigon/issues/12299 and heap-profiling a bit optimized here https://github.com/erigontech/erigon/pull/11710 So, we will fix it Erigon3 first - then will decide if it's easy to backport to Erigon2 or not.
good to know its spotted would you suggest using any workarounds in the meantime until this gets resolved ? can we downgrade nodes before "ReOpen" was introduced ?
I think it’s very old behavior. Do you see it after recent upgrade?
@JkLondon hi.
- Can you try to port your KnownCfg optimisation to Erigon2?
We did a node upgrade to 2.60.6 on 28th of August, so It doesn't look like its related to bumping up the version.
I can't say if its happening on versions before 2.60, but its happening for at least 3 months. It seems that it started getting a bit more spikey on second half of September. Not sure what these large spikes were tbh, but even if we ignore them it averages 7-10 seconds for eth_blockNumber call
However, I've noticed that in one of the Discord threads a guy was complaining in May that he received a lot of latency calls when he bumped from 2.58.2 to 2.60.0 and that he had to roll back to resolve hence my question for downgrading
Thank you for finding it. Seems that support request was not solved. We will work on it.
@JkLondon hi.
- Can you try to port your KnownCfg optimisation to Erigon2?
https://github.com/erigontech/erigon/pull/12351 have done this optimisation
I will work on removing need of files re-open (on e3 first then will port to e2): https://github.com/erigontech/erigon/pull/12332
@AskAlexSharov Can we expect a release with the fix anytime soon?
2.60.9 release happened
will test if the issue persists and get back in here
Sorry guys, this takes longer then anticipated to test since the underlying base image also changed so we gotta incorporate that . Will have some info for this by EOW
Thank you
Hey guys
We've basically tried to make 2.60.10 reduce the latency for across regions and here is what we found out and what we are battling against currently:
-
A set of our Ethereum nodes see the difference in ETH spikes getting reduced. These nodes have things in common => RPS is usually averaging under 100 RPS and their chaindata is on avg 2.3TB.
-
The other set of Ethereum nodes don't see any difference in spikes for eth_ calls. What these nodes have in common is RPS on average is higher then 100 RPS and their chaindata is 5.3TB +
-
None of the Polygon nodes see the difference in latency be it a higher RPS or lower RPS it stays the same.
This points honestly to blockchain size to be connected to this as I see Polygon also has big ledgers on our end ~ 9.9TB.
We were thinking of resyncing the nodes to get the latency up but honestly its not a small amount of work and want to see if that would be the correct step forward
Would appreciate ideas or suggestions
...and sorry for the hold up
- you can copy
2.3TBchaindata to nodes which have5.3TBchaindata. it's easier than resync. - I see you have
--rpc.batch.limit=10000. Maybe you can try increase/decrease--rpc.batch.concurrency? - are your metrics - produced by erigon? or you measure them outside of erigon?
- you can try: run separated rpcdaemon (on same datadir): https://github.com/erigontech/erigon/tree/main/cmd/rpcdaemon#getting-started - it will reduce impact of other components (txpool, chain-tip sync, etc...) to rpc (it will have own limits, etc...).
thanks for suggestions and ideas Alex
Dedicating time to the first point currently. Got no Polygon nodes with "normal" ledger sizes all increased up to 9TBs of data. For Ethereum, I'll copy over deff
Metrics are produced by a proxy service. They just messure the time it took to get the answer from Erigon. Proxies are deployed in the same network as Erigons so the network latency is minimal
For rpc-batch-concurency:
Got a metric that lets me filter by batch requests eg. check if the request was in a batch or not. The response time is the same even if the request is not in batch
For rpcdaemon :
will take a look
We've noticed improvments when replacing Ethereum ledgers, still not at its best comparable to geth, but its a lot better
wondering what causes these bloated ledgers ? is it debug calls ? Does this mean that each time we get a ledger that is bigger then average size, we have to download a new snapshot version or replace with non-bloated ones to improve latency ?
may be 2 reasons:
- long read transacion (debug_call - is not enough long) - doesn't allow to re-use free space. for example tx-leak: https://github.com/erigontech/erigon/pull/8596
- commits without fsync: we also faced this in TxPool (can't find PR now) - but not in chaindata.
To catch slow RPC calls: --rpc.slow=10m
To catch slow/leaked Ro Transactions: SLOW_TX=10m
download a new snapshot - only way to compact db: copy data to new db. for example by: integration mdbx_to_mdbx
Also can:
-
--sync.loop.prune.limit=10
Also can try Erigon3 - chaindata size there is 20gb. https://github.com/erigontech/erigon?tab=readme-ov-file#erigon3-changes-from-erigon2
thanks for insights, will get back to the team with this and discuss
small update - fresh Polygon DBs are not helping reduce the latency
hey @AskAlexSharov a quick question => is Erigon3 production ready ?
- Erigon3 is more stable for Polygon chains than Erigon2. Because it's way-easier for dev team to have/debug/fix problems of BorMainnet on Erigon3 (much easier to get to chain-tip or re-sync).
- It's higher chance to get performance issues support on Erigon3.
- It supports all RPC methods. And P2P too.
- It may greatly increase/reduce latency of Receipt/Log-related RPC methods - depends on your use-case.
- Several NodeProviders confirmed that Erigon3 works for them.
- We also have several validators on Erigon3.
- We are on the way of
betarelease - for us it means "releases will become more backward compatible. now some releases may advise/require re-sync". But re-sync is fast if you have enough network bandwidth. - We enabled
--internalclby default. But we had work on externalcl support too - but maybe have some issue: https://github.com/erigontech/erigon/issues/13321
Differences: https://github.com/erigontech/erigon?tab=readme-ov-file#erigon3-changes-from-erigon2 Book: https://docs.erigon.tech/
@AskAlexSharov https://github.com/erigontech/erigon/issues/13320 this one should be also addressed and would be an easy fix, is there any progress to fix the issue?
Let's make sure that the latency is good at least in E3.
yes. seems E3 doesn't have such problem. 12h perf test (eth-mainnet):
thank you. closing for now.