lnd icon indicating copy to clipboard operation
lnd copied to clipboard

[feature]: Design Blockbeat retry mechanism to guarantee block processing

Open ziggie1984 opened this issue 1 month ago • 1 comments

Currently when one subsystem fails to process the blockbeat in time we exit and don't thread the block through other subsystems if the block processing of the subsystem depends on each other. We log an error but we do act on it but hope that it resolves itself. However I think we need to be more rigorous here and create a retrying system which after x attempt if not able to process the block shuts the daemon done because this prevents hidden bugs where we are not able to process the beat in time.

Detailed analysis and design proposal will come when I have a bit more time.

ziggie1984 avatar Nov 21 '25 23:11 ziggie1984

The blockbeat is made to be stateless, however it can be beneficial to have state to help with reorg management, tho it would be another bigish refactor.

We log an error but we do act on it but hope that it resolves itself. However I think we need to be more rigorous here and create a retrying system which after x attempt if not able to process the block shuts the daemon

That would be a single channel can affect other channels, which can be dangerous if the daemon cannot start yet there are inflight HTLCs or unresolved force closes. We log an error as it's a standard practice to monitor the error logs and intervene manually when running software. In addition, to enable the blockbeat consumer to be able to consume the block twice can be very difficult.

I think we should only shut down the daemon when the dependent systems are broken, like the db or chain backend, which forbids lnd to progress its state completely.

yyforyongyu avatar Nov 24 '25 01:11 yyforyongyu