rust-lightning Make event handling fallible

Closes #2490.

Previously, we would require our users to handle all events successfully inline or panic will trying to do so. If they would exit the EventHandler any other way we'd forget about the event and wouldn't replay them after restart.

Here, we implement fallible event handling, allowing the user to return Err(()) which signals to our event providers they should abort event processing and replay any unhandled events later (i.e., in the next invocation).

TODO:

[ ] Add test coverage for replay behavior on Err(()).

Apr 15 '24 09:04 tnull

Previously, we would require our users to handle all events successfully inline or panic will trying to do so

I believe our recommendation was always to simply loop trying to handle the event until they succeed, which is basically what we're doing here for them :)

As to the code here, I think we should make more clear in the interface the event will be replayed, eg by making the error variant a unit struct called ReplayEvent or so. Further, I think we should set the wakeup flag immediately on any failed event-handle to force the BP to go around its loop again without any sleeping. Otherwise concept lgtm.

Apr 15 '24 15:04 TheBlueMatt

Codecov Report

Attention: Patch coverage is 63.91753% with 70 lines in your changes missing coverage. Please review.

Project coverage is 89.75%. Comparing base (0cfe55c) to head (e617a39).

Files	Patch %	Lines
lightning/src/onion_message/messenger.rs	58.06%	21 Missing and 5 partials :warning:
lightning/src/util/async_poll.rs	34.61%	14 Missing and 3 partials :warning:
lightning-background-processor/src/lib.rs	80.48%	5 Missing and 11 partials :warning:
lightning/src/chain/chainmonitor.rs	46.15%	6 Missing and 1 partial :warning:
lightning/src/chain/channelmonitor.rs	40.00%	3 Missing :warning:
lightning/src/events/mod.rs	50.00%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2995      +/-   ##
==========================================
- Coverage   89.79%   89.75%   -0.05%     
==========================================
  Files         121      121              
  Lines      100826   100916      +90     
  Branches   100826   100916      +90     
==========================================
+ Hits        90537    90576      +39     
- Misses       7614     7658      +44     
- Partials     2675     2682       +7

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

May 27 '24 10:05 codecov-commenter

Rebased and included a fixup to introduce ReplayEvent unit struct error variant. But still ironing out some details regarding the approach.

Further, I think we should set the wakeup flag immediately on any failed event-handle to force the BP to go around its loop again without any sleeping.

Could you clarify which flag you are referring to exactly? Do you mean just continueing the loop in case any of the event processors fails? If we do that, we should probably make sure it only happens once (and then maaaybe with an exponential back-off?) as otherwise the entire run loop might busy-wait if event handling keeps failing, e.g., due to a persistence failure?

Jun 03 '24 12:06 tnull

Could you clarify which flag you are referring to exactly?

The event_persist_notifier flag.

as otherwise the entire run loop might busy-wait

I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :)

Jun 03 '24 14:06 TheBlueMatt

The event_persist_notifier flag.

I'm confused: wouldn't this only work for ChannelManager's event handler, not the others, i.e., ChainMonitor, OnionMessenger?

I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :)

Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in async land this might steal a full working thread from the runtime, possibly leading to locking up the node entirely? So I'd prefer not to busy-wait without ever yielding in an async context?

Jun 03 '24 15:06 tnull

I'm confused: wouldn't this only work for ChannelManager's event handler, not the others, i.e., ChainMonitor, OnionMessenger?

Sure, they should all do a similar thing.

Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in async land this might steal a full working thread from the runtime, possibly leading to locking up the node entirely? So I'd prefer not to busy-wait without ever yielding in an async context?

Hmm, as long as the user has an async...anything handling the event that's failing we should yield at least once during the BP loop. We could add an explicit yield (in the form of a ~1ms sleep), I guess?

Jun 03 '24 15:06 TheBlueMatt

Rebased to resolve conflicts after #3060 landed, have yet to adjust MultiFuturePoller though.

Jun 05 '24 09:06 tnull

Now added logic to retain failed events in the OnionMessenger event queues to have them replayed upon next invocation. To do this I adjusted MultiFuturePoller to collect and return the actual event handling results.

Two observations:

It's a bit unfortunate that this requires us to clone the queues as we can only remove the events from the queues after they have been successfully processed (but we have the same issue in CM/CM).
While we replay events, we of course still won't persist the OnionMessenger events in any way. While I understand this is a deliberate design choice for DoS protection, it seems also like an API contract violation as events might get lost if the node restarts/crashes between event generation and the user handling it.

The event_persist_notifier flag.

Coming back to this: Upon further inspection it seems we set result = NotifyOption::DoPersist in process_events_body whenever we have pending events anyways? So the event_persist_notifier should already get triggered?

Generally this is still missing some test coverage, but should be ready for another round of review apart from that.

Jul 02 '24 13:07 tnull

Coming back to this: Upon further inspection it seems we set result = NotifyOption::DoPersist in process_events_body whenever we have pending events anyways? So the event_persist_notifier should already get triggered?

Ah, indeed, you're right. We should, however, add something similar in chainmonitor.

Jul 02 '24 20:07 TheBlueMatt

Ah, indeed, you're right. We should, however, add something similar in chainmonitor.

Agreed. Now added a fixup that has ChainMonitor trigger its event_notifier when any of the ChannelMonitor::process_pending_events calls fails.

Jul 03 '24 08:07 tnull

Added a simple BackgroundProcessor test to check events are being replayed and also added a commit documenting the failure-to-handle/persistence behavior of all event variants (i.e., addressed #2097). I now also dropped the serialization logic for OnionMessageIntercepted and OnionMessagePeerConnected, as we don't ever write these events.

Jul 08 '24 13:07 tnull

Addressed the feedback.

Jul 09 '24 07:07 tnull

Rebased on main to resolve minor conflicts after #3138 landed.

Let me know if I can squash the fixups.

Jul 15 '24 07:07 tnull

LGTM. Feel free to squash, IMO.

Squashed without further changes.

Jul 17 '24 17:07 tnull

The Make event handling fallible commit itself doesn't build, so check_commits is failing.

Jul 17 '24 18:07 TheBlueMatt

Squashed fixups without further changes:

> git diff-tree -U2  258853aed e617a394e
>

Jul 18 '24 13:07 tnull