rust-lightning ln: om: add option to internally queue forwards to offline peers

While implementing BOLT 12 onion-message forwarding to offline user nodes in our LSP, I found this relatively simple change in OnionMessenger that reduced the integration on our end to +50 loc. Figured I'd create a draft PR here to see if there's any appetite for this change--if so, I can work on making this more upstreamable :)

Commit

Adds an option to OnionMessenger to have it queue all onion message forwards to offline peers.

Context: when someone requests a BOLT 12 invoice from one of our user nodes, our LSP may need to first spin up their node before it can finally forward the onion message.

Why not use intercept_messages_for_offline_peers? We'd have to rebuild almost the exact same message_recipients queue outside LDK. In our case, there's no reason to persist the onion message forwards since our user nodes can run on-demand and should be ready to handle the forward in time. With this change, the logic turned out super simple on our end since we just have to handle the ConnectionNeeded event. All the pending messages then automatically forward once the peer connects. If the connection fails or it's not one of our peers, the pending messages will just naturally churn out in a few timer ticks.

May 02 '25 19:05 phlip9

👋 Hi! I see this is a draft PR. I'll wait to assign reviewers until you mark it as ready for review. Just convert it out of draft status when you're ready for review!

May 02 '25 19:05 ldk-reviews-bot

Codecov Report

Attention: Patch coverage is 34.48276% with 38 lines in your changes missing coverage. Please review.

Project coverage is 89.31%. Comparing base (8d44e80) to head (6b8b874).

Files with missing lines	Patch %	Lines
lightning/src/onion_message/messenger.rs	34.48%	38 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3765      +/-   ##
==========================================
- Coverage   89.37%   89.31%   -0.07%     
==========================================
  Files         157      157              
  Lines      124095   124150      +55     
  Branches   124095   124150      +55     
==========================================
- Hits       110914   110881      -33     
- Misses      10470    10547      +77     
- Partials     2711     2722      +11

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

May 02 '25 19:05 codecov[bot]

I imagine we could take this upstream, but we'd have to figure out a anti-DoS policy that makes sense for any intended use-cases here. For y'all, that probably means dropping messages if they're queued for more than a minute or three, but maybe for others it doesn't? Also we'll have to limit the queues, which make them fairly DoS-vulnerable. Its the DoS policy stuff that's why we didn't do this in the first place (well, plus persistence), and not really sure what the right answer is to those :/

May 02 '25 19:05 TheBlueMatt

For us at least, I think these (1) max total buffer size, (2) max per-peer buffer size, and (3) max queued lifetime (in timer ticks), are probably Good Enough^tm DoS/QoS policies for now, and are otherwise easily tunable.

What makes this diff less upstreamable IMO is that other LSPs probably can't make the same assumptions as us; their offline peers can't always come online in time so they need a persistent queue for these messages.

May 02 '25 21:05 phlip9

Right, I think that was kinda my point - its not entirely clear to me what the right DoS limits are here, and we'll have to figure them out in a general sense, not just for y'all. I think at a minimum that means some logic to figure out if the peer we're told to keep a message for is even a peer we care about (we don't know which channels we have in the OnionMessenger), which would be a bunch more work, I imagine.

May 05 '25 13:05 TheBlueMatt

its not entirely clear to me what the right DoS limits are here, and we'll have to figure them out in a general sense,

FWIW, given that we start introducing multiple anti-DoS limits in different parts of the codebase (e.g., also in lightning-liquidity, PeerManger, etc), I wonder if we'd need to start looking into a more general 'peer scoring' approach? Otherwise we may end up with limits that look alright by themselves, but an adversary might try to exploit all of them at the same time, and accumulated they might not hold up?

May 12 '25 08:05 tnull

I'm pretty skeptical of scores generally. Either a peer is wasting your resources or they aren't - if they are, disconnect (and ban) if they aren't, chill. We should definitely consider unifying some of our DoS limits especially as we often care about the "peer has a channel with us" distinction to exempt many peers from some DoS limits.

May 12 '25 23:05 TheBlueMatt