neofs-node
neofs-node copied to clipboard
Revise blockchain height check on startup
Inner Ring and Storage nodes check that height of the underlying blockchain height is greater or equal than the latest encountered one optionally persisted in the local storage (config and config respectively).
App requests current height by RPC, compares results with peristed one and fails if the local value is greater.
Which chain is stuck?
according to @aprasolova experience, we an encounter next error in log:
RPC block counter 738108 didn't reach expected height 2272533
It is not visible from the message which chain - main or side - is stuck. It's proposed to reflect blockchain kind in this log message.
Await or not await
it's possible that chain node currently synchronizes its state, and it hasn't reached up-to-date state yet. In this case NeoFS node will immediately fail. In fact, it could wait within some context (global or with some sane deadline) and free admin to periodically restart the app.
btw in code check function is called awaitHeight which syntactically implies a background wait, but in fact does not wait.
maybe there are other signs that will allow NeoFS to understand what exactly is happening at the moment and distinguish between freeze and synchronization, for example If so, then we could improve behavior and admin UX. @AnnaShaleva @roman-khimov
Blockchain reset
if chain was reset, and admin restarts the node - it will fail until fresh chain will reach the height not less than persisted one. In this case it's not obvious for admin that state should be reset too. As possible solution, we could also take into accout blockchain network magic, but it may be also left untouched.
btw in code check function is called awaitHeight which syntactically implies a background wait, but in fact does not wait.
There is some detail about it. It did wait in #798, but also stopped waiting in the same PR. So mb @532910 has some info about it (and the issue in general).
i also started to think about connection switch in multi-RPC setting. @carpawell ur an expert of this currently, pls explain how this reconn could affect our state sync
This block counter can't be perfect since local state can be dropped at any time. But it helps in some ways, so:
- specifying the network in the log is good
- waiting is OK as well, in some ways it's like a connection failure, no reason to fail completely
distinguish between freeze and synchronization
No 100% reliable way to do that. But StartWhenSynchronized
RPC option helps somewhat, at least the node is supposed to be up to date when it starts serving RPC (so this problem shouldn't happen at all).
if chain was reset
Just forget this for now.