celestia-node
celestia-node copied to clipboard
Reports of bursts of high disk R/Ws on `v0.12.4`
Seeing reports from BN node runners of high disk R/Ws on v0.12.4
in bursts of ~12 hours
The logs below are taken at moment of 85% CPU Load and 97% I/O
Feb 27 19:37:46 qupra-7900x celestia[290693]: 2024-02-27T19:37:46.979Z INFO header/store store/store.go:375 new head {"height": 879415, "hash": "E31E753744D9EC1F8EDE8421E39B9D4CF309509E465E3ADEDD3D9054335A6F1C"} Feb 27 19:37:58 qupra-7900x celestia[290693]: 2024-02-27T19:37:58.753Z INFO header/store store/store.go:375 new head {"height": 879416, "hash": "25B3D05E50BA5C1F96DDFF5DD7EE4BF7E21442088BF0BD729C0EE60BEC9CB136"} Feb 27 19:38:09 qupra-7900x celestia[290693]: 2024-02-27T19:38:09.826Z INFO header/store store/store.go:375 new head {"height": 879417, "hash": "0F87FFC4A2E61BF0C6E81315412079B6BDC660FB094A643C19F214A9D623E6D3"} Feb 27 19:38:21 qupra-7900x celestia[290693]: 2024-02-27T19:38:21.835Z INFO header/store store/store.go:375 new head {"height": 879418, "hash": "9B5687A2A1C94D1CF69EA46F29B44D72997B23806ECD05F484046683C328C7CC"} Feb 27 19:38:33 qupra-7900x celestia[290693]: 2024-02-27T19:38:33.046Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote) Feb 27 19:38:33 qupra-7900x celestia[290693]: 2024-02-27T19:38:33.739Z INFO header/store store/store.go:375 new head {"height": 879419, "hash": "528D22D54B2B8D16BEA06BDFB5BD638AEB024CDC1DA796EE66374853C35E991A"} Feb 27 19:38:45 qupra-7900x celestia[290693]: 2024-02-27T19:38:45.602Z WARN badger [email protected]/db.go:460 Block cache might be too small. Metrics: hit: 121107130 miss: 4241640680 keys-added: 285626328 keys-updated: 483835 keys-evicted: 285626311 cost-added: 2291943524773320 cost-evicted: 2291943262078825 sets-dropped: 31297 sets-rejected: 3810601313 gets-dropped: 111778176 gets-kept: 4250854976 gets-total: 4362747810 hit-ratio: 0.03 Feb 27 19:38:45 qupra-7900x celestia[290693]: 2024-02-27T19:38:45.602Z WARN badger [email protected]/db.go:461 Cache life expectancy (in seconds): Feb 27 19:38:45 qupra-7900x celestia[290693]: -- Histogram: Feb 27 19:38:45 qupra-7900x celestia[290693]: Min value: 0 Feb 27 19:38:45 qupra-7900x celestia[290693]: Max value: 150 Feb 27 19:38:45 qupra-7900x celestia[290693]: Count: 285624116 Feb 27 19:38:45 qupra-7900x celestia[290693]: 50p: 2.00 Feb 27 19:38:45 qupra-7900x celestia[290693]: 75p: 2.00 Feb 27 19:38:45 qupra-7900x celestia[290693]: 90p: 2.00 Feb 27 19:38:45 qupra-7900x celestia[290693]: [0, 2) 285196410 99.85% 99.85% Feb 27 19:38:45 qupra-7900x celestia[290693]: [2, 4) 279923 0.10% 99.95% Feb 27 19:38:45 qupra-7900x celestia[290693]: [4, 8) 105346 0.04% 99.99% Feb 27 19:38:45 qupra-7900x celestia[290693]: [8, 16) 32620 0.01% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: [16, 32) 8313 0.00% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: [32, 64) 1380 0.00% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: [64, 128) 118 0.00% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: [128, 256) 6 0.00% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: -- Feb 27 19:38:46 qupra-7900x celestia[290693]: 2024-02-27T19:38:46.064Z INFO header/store store/store.go:375 new head {"height": 879420, "hash": "407EB57A22E6260C4A3F0D2AA1E2AE0642C744B8500DDFCFEE9CD5403612C89D"} Feb 27 19:38:52 qupra-7900x celestia[290693]: 2024-02-27T19:38:52.055Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote) Feb 27 19:38:57 qupra-7900x celestia[290693]: 2024-02-27T19:38:57.919Z INFO header/store store/store.go:375 new head {"height": 879421, "hash": "6400B3E90304777540288AB6362912756AB0F9ED06D160EA25A117CBCEFCBC64"} Feb 27 19:39:09 qupra-7900x celestia[290693]: 2024-02-27T19:39:09.922Z INFO header/store store/store.go:375 new head {"height": 879422, "hash": "16CB14378210F6249F3F09E3B5C50E8AC5E3DBD3866C53D39EB70B48CB9349BA"} Feb 27 19:39:21 qupra-7900x celestia[290693]: 2024-02-27T19:39:21.733Z INFO header/store store/store.go:375 new head {"height": 879423, "hash": "B666D118EDF9214952D86C31F1C2ECF82554B7629801D580C09E430DAB8D91E0"} Feb 27 19:39:31 qupra-7900x celestia[290693]: 2024-02-27T19:39:31.197Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: received a stateless reset with token b9caffa2f9721c3b7362f27cc154618a Feb 27 19:39:31 qupra-7900x celestia[290693]: 2024-02-27T19:39:31.693Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: stream reset Feb 27 19:39:33 qupra-7900x celestia[290693]: 2024-02-27T19:39:33.743Z INFO header/store store/store.go:375 new head {"height": 879424, "hash": "0B17EB891D6B08E6D8CEC1282C254A766011CA157D2BDC78475EF1CB63675987"} Feb 27 19:39:39 qupra-7900x celestia[290693]: 2024-02-27T19:39:39.642Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote) Feb 27 19:39:45 qupra-7900x celestia[290693]: 2024-02-27T19:39:45.602Z WARN badger [email protected]/db.go:460 Block cache might be too small. Metrics: hit: 121163085 miss: 4244080664 keys-added: 285820179 keys-updated: 483976 keys-evicted: 285820162 cost-added: 2293655197376295 cost-evicted: 2293654954690637 sets-dropped: 31297 sets-rejected: 3812847305 gets-dropped: 111778560 gets-kept: 4253350464 gets-total: 4365243749 hit-ratio: 0.03 Feb 27 19:39:45 qupra-7900x celestia[290693]: 2024-02-27T19:39:45.602Z WARN badger [email protected]/db.go:461 Cache life expectancy (in seconds): Feb 27 19:39:45 qupra-7900x celestia[290693]: -- Histogram: Feb 27 19:39:45 qupra-7900x celestia[290693]: Min value: 0 Feb 27 19:39:45 qupra-7900x celestia[290693]: Max value: 150 Feb 27 19:39:45 qupra-7900x celestia[290693]: Count: 285817967 Feb 27 19:39:45 qupra-7900x celestia[290693]: 50p: 2.00 Feb 27 19:39:45 qupra-7900x celestia[290693]: 75p: 2.00 Feb 27 19:39:45 qupra-7900x celestia[290693]: 90p: 2.00 Feb 27 19:39:45 qupra-7900x celestia[290693]: [0, 2) 285390261 99.85% 99.85% Feb 27 19:39:45 qupra-7900x celestia[290693]: [2, 4) 279923 0.10% 99.95% Feb 27 19:39:45 qupra-7900x celestia[290693]: [4, 8) 105346 0.04% 99.99% Feb 27 19:39:45 qupra-7900x celestia[290693]: [8, 16) 32620 0.01% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: [16, 32) 8313 0.00% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: [32, 64) 1380 0.00% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: [64, 128) 118 0.00% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: [128, 256) 6 0.00% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: --
I think the above logs, don't tell much, more verbose logs or metrics would be required. We are posting our metrics with otel-collector, see below our node-id. If there are any requests in terms of logs/screenshots/data please tell us what you need.
I'll post below some screenshots from our node-exporter. The server is solely dedicated to Celestia Bridge no other services running on it.
Specs: 7900X (12-core) and a fast NVME drive. node-id: 12D3KooWRH2pkayUDqmCAXUpQSjX1pmmftPPNEaedEdUSEBRfLJH OS: 22.04 kernel: 6.5.0-15-generic
Range: Last 2 Days
Range: Last 2 Days
First time we detected this issue was on 05 Jan, we were using CELESTIA-NODE v0.12.1.
Our findings back then (note: we're using NVMEs):
In our case, not the open files are the issue. Yes, it has ~7k open files, but the CPU and this load increase are insane. It started all at 9:30 UTC (11:30 in our Grafana which uses GMT+2) the CPU increased from an average of 1 CPU Load, to a 12 CPU Load. -> 12x increase in CPU LOAD DISK, it was at 542 IOPS READ and 44 IOPS WRITE ->>>> NOW it has 17k (yes thousands) IOPS READ and 4.6K IOPS write. Since 9:30 UTC we also noticed an increase in active TCP ESTABLISHED connections. I correlated this to active P2P connections Then I blocked in firewall all incoming trafic towards the P2P port, and I restarted the node. The load on CPU/DISK/Network are back to "low" values (also visible on the graphs: baseline is much lower than before after this step)
It seems to us that some peer/peers are the reason.
Checked the logs from that time, and we found the following errors that are relevant:
2024-01-05T11:28:59+02:00 2024-01-05T11:28:59.488+0200 ERROR header/p2p p2p/server.go:157 server: writing header to stream {"err": "i/o deadline reached"}
2024-01-05T11:29:00+02:00 2024-01-05T11:29:00.302+0200 INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: stream reset
2024-01-05T11:29:00+02:00 2024-01-05T11:29:00.613+0200 ERROR header/p2p p2p/server.go:157 server: writing header to stream {"err": "stream reset"}
A 4 month graph shows a clear improvement of the resource usage on our bridge since we blocked incoming p2p traffic (Red line = ufw deny)
We had a spike in our IO too especially the read part . Below are the bridge node logs during the high IO reads. We are using 2 Nvme drives on Raid0 .
IO graphs
Seeig repeated entries like below in the logs
Feb 28 11:09:41 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:09:41.019+0100 INFO header/store
store/store.go:375 new head {"height": 883855, "hash": "762641C9AB7F543B835063483575DFE627FFC874B9D
8E92555765776702E2676"}
Feb 28 11:09:53 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:09:53.224+0100 INFO header/store
store/store.go:375 new head {"height": 883856, "hash": "BF8BBEF57BC796672B48BF4B09790ABC3D346C3D857
FAB00584DAC66712655FE"}
Feb 28 11:10:02 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:02.274+0100 INFO bitswap-server
server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 28 11:10:07 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:07.044+0100 INFO header/store
store/store.go:375 new head {"height": 883857, "hash": "2154691F4B77A2CBE1C44BBD90A3B83A554B4C38646
E0EA08A7521EC27A2EE07"}
Feb 28 11:10:07 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:07.452+0100 INFO bitswap-server
server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 28 11:10:18 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:18.538+0100 INFO bitswap-server
server/server.go:547 Bitswap Client ReceiveError: stream reset
Feb 28 11:10:18 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:18.798+0100 INFO header/store
store/store.go:375 new head {"height": 883858, "hash": "7D70166FBD419C3A99192C44E6914657896802E1457
7C05CB34D8B756EEEE812"}
Feb 28 11:10:22 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:22.874+0100 INFO bitswap-server
server/server.go:547 Bitswap Client ReceiveError: stream reset
Feb 28 11:10:30 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:30.805+0100 INFO header/store
store/store.go:375 new head {"height": 883859, "hash": "3AC06314404321D2784A65E711714F3C8126EDD6D17
C1C51DAD756472FF0451D"}
Feb 28 11:10:33 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:33.626+0100 WARN badger v4@v
4.0.0-20231125230536-2b9e13346f75/db.go:460 Block cache might be too small. Metrics: hit: 187054252 miss: 51
61187805 keys-added: 544306772 keys-updated: 1442029 keys-evicted: 544306743 cost-added: 5447645463552796 cost-evic
ted: 5447645199982723 sets-dropped: 1408 sets-rejected: 4483552722 gets-dropped: 25108032 gets-kept: 5321904256 get
s-total: 5348242073 hit-ratio: 0.03
TCP connections in the server over that period
On v0.13.0
we're also seeing similar logs and resource usage as the above. Mainly very high read usage.
Blocking p2p port 2121
didn't seem to help our bridge node. Logs resemble what has already been posted.
This is unlikely to be a peering issue but a compaction run by indexing in badger. There is not much we can do about it at this point, as it's a fundamental flaw in the current storage mechanism. The solution for the existing and active implementation phase would need a few months to land.
Meanwhile, If that's really an issue killing disks, we can explore pruning for the inverted index. cc @renaynay and @walldiss
Progress on storage upgrades here: https://github.com/celestiaorg/celestia-node/issues/2971
In our cases the problem occurs by affecting only the CPU load.
A restart solves always the problem.
Tested with very high performance bare-metal (Intel + NVMe Gen4) and VMware clusters with 8x32Gbit FC NVMe SAN storage in private data centers
Note: The memory has been reduced from 128GB to 32GB (since it is not used)
Current:
Last consideration, the current graphs refer to the testnet, as the mainnet bridge node was recently shut down because we were not selected again with the delegation (despite the details of our infrastructure with 6/8 dedicated enterprise nodes)
Here chain detail and connections graph (without changes)
We are available to provide further details...
We don't have such load on mocha bridge, yes there are some spikes but thats okay, but with mainnet bridge the situation is terrible.
Logs do not say a lot, just noticed that this Bitswap error much more frequently than in testnet.
INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote)
I can confirm the problem on both chains, but as I indicated I was forced to shut down the mainnet bridge node and I no longer have the history.
Regarding the problem in testnet I can say that a reboot of the process fixes the problem, every time.
They could be unrelated or connected issues, and a possibly study on the testnet structure and then verify it on the mainnet where the problem is most intense