Make event handling fallible
Closes #2490.
Previously, we would require our users to handle all events successfully inline or panic will trying to do so. If they would exit the EventHandler any other way we'd forget about the event and wouldn't replay them after restart.
Here, we implement fallible event handling, allowing the user to return Err(()) which signals to our event providers they should abort event processing and replay any unhandled events later (i.e., in the next invocation).
TODO:
- [ ] Add test coverage for replay behavior on
Err(()).
Previously, we would require our users to handle all events successfully inline or panic will trying to do so
I believe our recommendation was always to simply loop trying to handle the event until they succeed, which is basically what we're doing here for them :)
As to the code here, I think we should make more clear in the interface the event will be replayed, eg by making the error variant a unit struct called ReplayEvent or so. Further, I think we should set the wakeup flag immediately on any failed event-handle to force the BP to go around its loop again without any sleeping. Otherwise concept lgtm.
Codecov Report
Attention: Patch coverage is 63.91753% with 70 lines in your changes missing coverage. Please review.
Project coverage is 89.75%. Comparing base (
0cfe55c) to head (e617a39).
Additional details and impacted files
@@ Coverage Diff @@
## main #2995 +/- ##
==========================================
- Coverage 89.79% 89.75% -0.05%
==========================================
Files 121 121
Lines 100826 100916 +90
Branches 100826 100916 +90
==========================================
+ Hits 90537 90576 +39
- Misses 7614 7658 +44
- Partials 2675 2682 +7
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Rebased and included a fixup to introduce ReplayEvent unit struct error variant. But still ironing out some details regarding the approach.
Further, I think we should set the wakeup flag immediately on any failed event-handle to force the BP to go around its loop again without any sleeping.
Could you clarify which flag you are referring to exactly? Do you mean just continueing the loop in case any of the event processors fails? If we do that, we should probably make sure it only happens once (and then maaaybe with an exponential back-off?) as otherwise the entire run loop might busy-wait if event handling keeps failing, e.g., due to a persistence failure?
Could you clarify which flag you are referring to exactly?
The event_persist_notifier flag.
as otherwise the entire run loop might busy-wait
I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :)
The event_persist_notifier flag.
I'm confused: wouldn't this only work for ChannelManager's event handler, not the others, i.e., ChainMonitor, OnionMessenger?
I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :)
Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in async land this might steal a full working thread from the runtime, possibly leading to locking up the node entirely? So I'd prefer not to busy-wait without ever yielding in an async context?
I'm confused: wouldn't this only work for ChannelManager's event handler, not the others, i.e., ChainMonitor, OnionMessenger?
Sure, they should all do a similar thing.
Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in async land this might steal a full working thread from the runtime, possibly leading to locking up the node entirely? So I'd prefer not to busy-wait without ever yielding in an async context?
Hmm, as long as the user has an async...anything handling the event that's failing we should yield at least once during the BP loop. We could add an explicit yield (in the form of a ~1ms sleep), I guess?
Rebased to resolve conflicts after #3060 landed, have yet to adjust MultiFuturePoller though.
Now added logic to retain failed events in the OnionMessenger event queues to have them replayed upon next invocation. To do this I adjusted MultiFuturePoller to collect and return the actual event handling results.
Two observations:
- It's a bit unfortunate that this requires us to clone the queues as we can only remove the events from the queues after they have been successfully processed (but we have the same issue in
CM/CM). - While we replay events, we of course still won't persist the
OnionMessengerevents in any way. While I understand this is a deliberate design choice for DoS protection, it seems also like an API contract violation as events might get lost if the node restarts/crashes between event generation and the user handling it.
The
event_persist_notifierflag.
Coming back to this: Upon further inspection it seems we set result = NotifyOption::DoPersist in process_events_body whenever we have pending events anyways? So the event_persist_notifier should already get triggered?
Generally this is still missing some test coverage, but should be ready for another round of review apart from that.
Coming back to this: Upon further inspection it seems we set result = NotifyOption::DoPersist in process_events_body whenever we have pending events anyways? So the event_persist_notifier should already get triggered?
Ah, indeed, you're right. We should, however, add something similar in chainmonitor.
Ah, indeed, you're right. We should, however, add something similar in
chainmonitor.
Agreed. Now added a fixup that has ChainMonitor trigger its event_notifier when any of the ChannelMonitor::process_pending_events calls fails.
Added a simple BackgroundProcessor test to check events are being replayed and also added a commit documenting the failure-to-handle/persistence behavior of all event variants (i.e., addressed #2097). I now also dropped the serialization logic for OnionMessageIntercepted and OnionMessagePeerConnected, as we don't ever write these events.
Addressed the feedback.
Rebased on main to resolve minor conflicts after #3138 landed.
Let me know if I can squash the fixups.
LGTM. Feel free to squash, IMO.
Squashed without further changes.
The Make event handling fallible commit itself doesn't build, so check_commits is failing.
Squashed fixups without further changes:
> git diff-tree -U2 258853aed e617a394e
>