nearcore
nearcore copied to clipboard
Generalized Workload Change
Describe the bug
Since the release of 1.28.0 more validators in the set seem to experience < 100% uptime on a daily basis.
For our node in particular we have seen great disk READ activity which seems to peak during the periods when we drop blocks:
To Reproduce Run a validator on main-net
Expected behavior Broader 100% uptime for validators as per 1.27.0
Screenshots
Version (please complete the following information):
- 1.28.0
- mainnet
Additional context It's anecdotal but check here: https://near-overview.genesislab.net/
6 of the top 10 are < 100%. This wasn't the case pre 1.28.0 on a regular basis.
Do you also see just blocks missed with all chunks being produced at 100%? We have not seen an increase in reads compared to 30 days before that.
No, the misses appear to have been both blocks and chunks in parallel.
The read activity uptick seems to be continuing but generally the validator uptime across the set seems more stable than it was but less stable than before the current release. Still regularly see top 10 missing 100% which was rare before.
@matklad @mm-near @mzhangmzz could this be related to the client actor refactoring (applying chunks in different threads) that is introduced in 1.28.0?
No, that code isn't part of 1.28.0
The code in question is here:
https://github.com/near/nearcore/blob/f7289f939584652d8c9c246898a1fa89bb61b92d/chain/chain/src/chain.rs#L1966-L1976
It is not in 1.28:
https://github.com/near/nearcore/blob/1.28.0/chain/chain/src/chain.rs