celestia-node icon indicating copy to clipboard operation
celestia-node copied to clipboard

Reports of bursts of high disk R/Ws on `v0.12.4`

Open renaynay opened this issue 11 months ago • 11 comments

Seeing reports from BN node runners of high disk R/Ws on v0.12.4 in bursts of ~12 hours

renaynay avatar Feb 27 '24 15:02 renaynay

The logs below are taken at moment of 85% CPU Load and 97% I/O

Feb 27 19:37:46 qupra-7900x celestia[290693]: 2024-02-27T19:37:46.979Z INFO header/store store/store.go:375 new head {"height": 879415, "hash": "E31E753744D9EC1F8EDE8421E39B9D4CF309509E465E3ADEDD3D9054335A6F1C"} Feb 27 19:37:58 qupra-7900x celestia[290693]: 2024-02-27T19:37:58.753Z INFO header/store store/store.go:375 new head {"height": 879416, "hash": "25B3D05E50BA5C1F96DDFF5DD7EE4BF7E21442088BF0BD729C0EE60BEC9CB136"} Feb 27 19:38:09 qupra-7900x celestia[290693]: 2024-02-27T19:38:09.826Z INFO header/store store/store.go:375 new head {"height": 879417, "hash": "0F87FFC4A2E61BF0C6E81315412079B6BDC660FB094A643C19F214A9D623E6D3"} Feb 27 19:38:21 qupra-7900x celestia[290693]: 2024-02-27T19:38:21.835Z INFO header/store store/store.go:375 new head {"height": 879418, "hash": "9B5687A2A1C94D1CF69EA46F29B44D72997B23806ECD05F484046683C328C7CC"} Feb 27 19:38:33 qupra-7900x celestia[290693]: 2024-02-27T19:38:33.046Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote) Feb 27 19:38:33 qupra-7900x celestia[290693]: 2024-02-27T19:38:33.739Z INFO header/store store/store.go:375 new head {"height": 879419, "hash": "528D22D54B2B8D16BEA06BDFB5BD638AEB024CDC1DA796EE66374853C35E991A"} Feb 27 19:38:45 qupra-7900x celestia[290693]: 2024-02-27T19:38:45.602Z WARN badger [email protected]/db.go:460 Block cache might be too small. Metrics: hit: 121107130 miss: 4241640680 keys-added: 285626328 keys-updated: 483835 keys-evicted: 285626311 cost-added: 2291943524773320 cost-evicted: 2291943262078825 sets-dropped: 31297 sets-rejected: 3810601313 gets-dropped: 111778176 gets-kept: 4250854976 gets-total: 4362747810 hit-ratio: 0.03 Feb 27 19:38:45 qupra-7900x celestia[290693]: 2024-02-27T19:38:45.602Z WARN badger [email protected]/db.go:461 Cache life expectancy (in seconds): Feb 27 19:38:45 qupra-7900x celestia[290693]: -- Histogram: Feb 27 19:38:45 qupra-7900x celestia[290693]: Min value: 0 Feb 27 19:38:45 qupra-7900x celestia[290693]: Max value: 150 Feb 27 19:38:45 qupra-7900x celestia[290693]: Count: 285624116 Feb 27 19:38:45 qupra-7900x celestia[290693]: 50p: 2.00 Feb 27 19:38:45 qupra-7900x celestia[290693]: 75p: 2.00 Feb 27 19:38:45 qupra-7900x celestia[290693]: 90p: 2.00 Feb 27 19:38:45 qupra-7900x celestia[290693]: [0, 2) 285196410 99.85% 99.85% Feb 27 19:38:45 qupra-7900x celestia[290693]: [2, 4) 279923 0.10% 99.95% Feb 27 19:38:45 qupra-7900x celestia[290693]: [4, 8) 105346 0.04% 99.99% Feb 27 19:38:45 qupra-7900x celestia[290693]: [8, 16) 32620 0.01% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: [16, 32) 8313 0.00% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: [32, 64) 1380 0.00% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: [64, 128) 118 0.00% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: [128, 256) 6 0.00% 100.00% Feb 27 19:38:45 qupra-7900x celestia[290693]: -- Feb 27 19:38:46 qupra-7900x celestia[290693]: 2024-02-27T19:38:46.064Z INFO header/store store/store.go:375 new head {"height": 879420, "hash": "407EB57A22E6260C4A3F0D2AA1E2AE0642C744B8500DDFCFEE9CD5403612C89D"} Feb 27 19:38:52 qupra-7900x celestia[290693]: 2024-02-27T19:38:52.055Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote) Feb 27 19:38:57 qupra-7900x celestia[290693]: 2024-02-27T19:38:57.919Z INFO header/store store/store.go:375 new head {"height": 879421, "hash": "6400B3E90304777540288AB6362912756AB0F9ED06D160EA25A117CBCEFCBC64"} Feb 27 19:39:09 qupra-7900x celestia[290693]: 2024-02-27T19:39:09.922Z INFO header/store store/store.go:375 new head {"height": 879422, "hash": "16CB14378210F6249F3F09E3B5C50E8AC5E3DBD3866C53D39EB70B48CB9349BA"} Feb 27 19:39:21 qupra-7900x celestia[290693]: 2024-02-27T19:39:21.733Z INFO header/store store/store.go:375 new head {"height": 879423, "hash": "B666D118EDF9214952D86C31F1C2ECF82554B7629801D580C09E430DAB8D91E0"} Feb 27 19:39:31 qupra-7900x celestia[290693]: 2024-02-27T19:39:31.197Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: received a stateless reset with token b9caffa2f9721c3b7362f27cc154618a Feb 27 19:39:31 qupra-7900x celestia[290693]: 2024-02-27T19:39:31.693Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: stream reset Feb 27 19:39:33 qupra-7900x celestia[290693]: 2024-02-27T19:39:33.743Z INFO header/store store/store.go:375 new head {"height": 879424, "hash": "0B17EB891D6B08E6D8CEC1282C254A766011CA157D2BDC78475EF1CB63675987"} Feb 27 19:39:39 qupra-7900x celestia[290693]: 2024-02-27T19:39:39.642Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote) Feb 27 19:39:45 qupra-7900x celestia[290693]: 2024-02-27T19:39:45.602Z WARN badger [email protected]/db.go:460 Block cache might be too small. Metrics: hit: 121163085 miss: 4244080664 keys-added: 285820179 keys-updated: 483976 keys-evicted: 285820162 cost-added: 2293655197376295 cost-evicted: 2293654954690637 sets-dropped: 31297 sets-rejected: 3812847305 gets-dropped: 111778560 gets-kept: 4253350464 gets-total: 4365243749 hit-ratio: 0.03 Feb 27 19:39:45 qupra-7900x celestia[290693]: 2024-02-27T19:39:45.602Z WARN badger [email protected]/db.go:461 Cache life expectancy (in seconds): Feb 27 19:39:45 qupra-7900x celestia[290693]: -- Histogram: Feb 27 19:39:45 qupra-7900x celestia[290693]: Min value: 0 Feb 27 19:39:45 qupra-7900x celestia[290693]: Max value: 150 Feb 27 19:39:45 qupra-7900x celestia[290693]: Count: 285817967 Feb 27 19:39:45 qupra-7900x celestia[290693]: 50p: 2.00 Feb 27 19:39:45 qupra-7900x celestia[290693]: 75p: 2.00 Feb 27 19:39:45 qupra-7900x celestia[290693]: 90p: 2.00 Feb 27 19:39:45 qupra-7900x celestia[290693]: [0, 2) 285390261 99.85% 99.85% Feb 27 19:39:45 qupra-7900x celestia[290693]: [2, 4) 279923 0.10% 99.95% Feb 27 19:39:45 qupra-7900x celestia[290693]: [4, 8) 105346 0.04% 99.99% Feb 27 19:39:45 qupra-7900x celestia[290693]: [8, 16) 32620 0.01% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: [16, 32) 8313 0.00% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: [32, 64) 1380 0.00% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: [64, 128) 118 0.00% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: [128, 256) 6 0.00% 100.00% Feb 27 19:39:45 qupra-7900x celestia[290693]: --

Suleymann85 avatar Feb 27 '24 19:02 Suleymann85

I think the above logs, don't tell much, more verbose logs or metrics would be required. We are posting our metrics with otel-collector, see below our node-id. If there are any requests in terms of logs/screenshots/data please tell us what you need.

I'll post below some screenshots from our node-exporter. The server is solely dedicated to Celestia Bridge no other services running on it.

Specs: 7900X (12-core) and a fast NVME drive. node-id: 12D3KooWRH2pkayUDqmCAXUpQSjX1pmmftPPNEaedEdUSEBRfLJH OS: 22.04 kernel: 6.5.0-15-generic

Range: Last 2 Days image

Range: Last 2 Days image

ilhanu avatar Feb 28 '24 08:02 ilhanu

First time we detected this issue was on 05 Jan, we were using CELESTIA-NODE v0.12.1.

Our findings back then (note: we're using NVMEs):

In our case, not the open files are the issue. Yes, it has ~7k open files, but the CPU and this load increase are insane. It started all at 9:30 UTC (11:30 in our Grafana which uses GMT+2) the CPU increased from an average of 1 CPU Load, to a 12 CPU Load. -> 12x increase in CPU LOAD DISK, it was at 542 IOPS READ and 44 IOPS WRITE ->>>> NOW it has 17k (yes thousands) IOPS READ and 4.6K IOPS write. Since 9:30 UTC we also noticed an increase in active TCP ESTABLISHED connections. I correlated this to active P2P connections Then I blocked in firewall all incoming trafic towards the P2P port, and I restarted the node. The load on CPU/DISK/Network are back to "low" values (also visible on the graphs: baseline is much lower than before after this step)

image image image image image

It seems to us that some peer/peers are the reason.

Checked the logs from that time, and we found the following errors that are relevant:

2024-01-05T11:28:59+02:00	2024-01-05T11:28:59.488+0200	ERROR	header/p2p	p2p/server.go:157	server: writing header to stream	{"err": "i/o deadline reached"}
2024-01-05T11:29:00+02:00	2024-01-05T11:29:00.302+0200	INFO	bitswap-server	server/server.go:547	Bitswap Client ReceiveError: stream reset
2024-01-05T11:29:00+02:00	2024-01-05T11:29:00.613+0200	ERROR	header/p2p	p2p/server.go:157	server: writing header to stream	{"err": "stream reset"} 

A 4 month graph shows a clear improvement of the resource usage on our bridge since we blocked incoming p2p traffic (Red line = ufw deny)

image

rares-e2s avatar Feb 28 '24 10:02 rares-e2s

We had a spike in our IO too especially the read part . Below are the bridge node logs during the high IO reads. We are using 2 Nvme drives on Raid0 .

IO graphs image image

Seeig repeated entries like below in the logs

Feb 28 11:09:41 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:09:41.019+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883855, "hash": "762641C9AB7F543B835063483575DFE627FFC874B9D
8E92555765776702E2676"}
Feb 28 11:09:53 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:09:53.224+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883856, "hash": "BF8BBEF57BC796672B48BF4B09790ABC3D346C3D857
FAB00584DAC66712655FE"}
Feb 28 11:10:02 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:02.274+0100        INFO        bitswap-server    
    server/server.go:547        Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 28 11:10:07 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:07.044+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883857, "hash": "2154691F4B77A2CBE1C44BBD90A3B83A554B4C38646
E0EA08A7521EC27A2EE07"}
Feb 28 11:10:07 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:07.452+0100        INFO        bitswap-server    
    server/server.go:547        Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 28 11:10:18 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:18.538+0100        INFO        bitswap-server    
    server/server.go:547        Bitswap Client ReceiveError: stream reset
Feb 28 11:10:18 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:18.798+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883858, "hash": "7D70166FBD419C3A99192C44E6914657896802E1457
7C05CB34D8B756EEEE812"}
Feb 28 11:10:22 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:22.874+0100        INFO        bitswap-server    
    server/server.go:547        Bitswap Client ReceiveError: stream reset
Feb 28 11:10:30 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:30.805+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883859, "hash": "3AC06314404321D2784A65E711714F3C8126EDD6D17
C1C51DAD756472FF0451D"}
Feb 28 11:10:33 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:33.626+0100        WARN        badger        v4@v
4.0.0-20231125230536-2b9e13346f75/db.go:460        Block cache might be too small. Metrics: hit: 187054252 miss: 51
61187805 keys-added: 544306772 keys-updated: 1442029 keys-evicted: 544306743 cost-added: 5447645463552796 cost-evic
ted: 5447645199982723 sets-dropped: 1408 sets-rejected: 4483552722 gets-dropped: 25108032 gets-kept: 5321904256 get
s-total: 5348242073 hit-ratio: 0.03

TCP connections in the server over that period

image

Staking7pc avatar Feb 28 '24 22:02 Staking7pc

On v0.13.0 we're also seeing similar logs and resource usage as the above. Mainly very high read usage.

image

Blocking p2p port 2121 didn't seem to help our bridge node. Logs resemble what has already been posted.

Aderks avatar Mar 26 '24 19:03 Aderks

This is unlikely to be a peering issue but a compaction run by indexing in badger. There is not much we can do about it at this point, as it's a fundamental flaw in the current storage mechanism. The solution for the existing and active implementation phase would need a few months to land.

Meanwhile, If that's really an issue killing disks, we can explore pruning for the inverted index. cc @renaynay and @walldiss

Wondertan avatar Mar 26 '24 22:03 Wondertan

Progress on storage upgrades here: https://github.com/celestiaorg/celestia-node/issues/2971

MSevey avatar May 02 '24 13:05 MSevey

In our cases the problem occurs by affecting only the CPU load.

immagine

A restart solves always the problem.

Tested with very high performance bare-metal (Intel + NVMe Gen4) and VMware clusters with 8x32Gbit FC NVMe SAN storage in private data centers

immagine

Note: The memory has been reduced from 128GB to 32GB (since it is not used)

immagine

Current: immagine

Last consideration, the current graphs refer to the testnet, as the mainnet bridge node was recently shut down because we were not selected again with the delegation (despite the details of our infrastructure with 6/8 dedicated enterprise nodes)

Here chain detail and connections graph (without changes)

immagine

We are available to provide further details...

activenodes avatar May 06 '24 10:05 activenodes

We don't have such load on mocha bridge, yes there are some spikes but thats okay, but with mainnet bridge the situation is terrible. Screenshot 2024-05-07 at 12 16 19 Screenshot 2024-05-07 at 12 16 01 Screenshot 2024-05-07 at 12 15 46 Screenshot 2024-05-07 at 12 15 27

Logs do not say a lot, just noticed that this Bitswap error much more frequently than in testnet.

INFO	bitswap-server	server/server.go:547	Bitswap Client ReceiveError: Application error 0x0 (remote)

mogoll92 avatar May 07 '24 09:05 mogoll92

I can confirm the problem on both chains, but as I indicated I was forced to shut down the mainnet bridge node and I no longer have the history.

Regarding the problem in testnet I can say that a reboot of the process fixes the problem, every time.

They could be unrelated or connected issues, and a possibly study on the testnet structure and then verify it on the mainnet where the problem is most intense

activenodes avatar May 07 '24 09:05 activenodes