rust-lightning Make timer_tick_occurred a second long timer

This PR updates timer_tick_occurred to operate on a second-long timer, providing finer timing control.

Reasoning:

This PR is the prerequisite of #3010, which introduces the ability to retry the InvoiceRequest message in a finite time if we have not received the corresponding BOLT12 Invoice back.
The retries need to be faster than a standard minute-long timer.
The discussion was between introducing a new counter vs changing timer_tick duration (see comment). This PR is an attempt at the second approach.

May 14 '24 13:05 shaavan

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.03%. Comparing base (1890e80) to head (9a344b6). Report is 11 commits behind head on main.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3065      +/-   ##
==========================================
+ Coverage   89.82%   90.03%   +0.20%     
==========================================
  Files         116      116              
  Lines       96466    98074    +1608     
  Branches    96466    98074    +1608     
==========================================
+ Hits        86655    88305    +1650     
+ Misses       7264     7244      -20     
+ Partials     2547     2525      -22

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

May 14 '24 13:05 codecov-commenter

This enhancement enhances the system's flexibility and accuracy in timing operations.

This doesn't really say anything. Please clearly state the practical reason for the change.

May 14 '24 14:05 jkczyz

What functionality do we plan to add that will need sub-minute ticks? Also, how do we ensure users migrate to the new duration on their end? This is basically a "silent" breaking change to the API.

May 14 '24 20:05 wpaulino

The motivation here is to retry invoice_requests. For non-mobile in a world where we can route onion messages, its probably not useful because we can route onion messages along many paths and retrying is probably not useful. On mobile its kinda useful, especially before we can route, where we may have lost connectivity and we just need to retry a second later. Sadly, there's not really a great way to do this sans-std.

May 14 '24 20:05 TheBlueMatt

The alternative options here are basically (a) do it during message handling using std to detect time passing, (b) do it during message handling without time passing, (c) this (maybe with renaming the method to break the API). I'm a bit split between (b) and (c), honestly.

May 14 '24 20:05 TheBlueMatt

I see, (b) is probably still not reliable enough for our needs though if we're not receiving messages (pings don't make it through the ChannelManager)? I guess (c) is fine, but have we explored doing this in OnionMessageHandler::release_pending_messages instead?

May 14 '24 23:05 wpaulino

Updated from pr3065.01 to pr3065.02 (diff): Addressed @TheBlueMatt comment:

Changes:

Introduce a new constant TICKS_PER_MINUTE and use it as the multiplying factor for other constants. This will allow us to keep track of the multiplying factor and make maintenance easy, in case we change this later.

@TheBlueMatt I think renaming the function to cause an explicit API failure will be a good idea. However, I haven't implemented it yet. Would love to make sure that it is the right move before I move forward with the name change. Thanks for the discussion, suggestions, and pointers, everyone.

May 15 '24 12:05 shaavan

I guess (c) is fine, but have we explored doing this in OnionMessageHandler::release_pending_messages instead?

Hmm... that may work as ChannelManager can query its PendingOutboundPayments for anything still awaiting an invoice that hasn't timed out. Though when initially sending the message, it will already be in the AwaitingInvoice state. Maybe we only retry if release_pending_messages would return an empty Vec as a simple way of not retrying before the initial request is sent? Even so, would there be sufficient enough time passing between calls to next_onion_message_for_peer before retrying?

May 15 '24 13:05 jkczyz

It gets called in PeerManager::do_attempt_write_data, which can happen every time a ping needs to be sent out (currently every 10 secs via PeerManager::timer_tick_occurred), and on every call to PeerManager::process_events and PeerManager::write_buffer_space_avail. So, it may actually get called too often, but some simple rate limiting per retry request in release_pending_messages would help.

May 15 '24 17:05 wpaulino

It gets called in PeerManager::do_attempt_write_data, which can happen every time a ping needs to be sent out (currently every 10 secs via PeerManager::timer_tick_occurred), and on every call to PeerManager::process_events and PeerManager::write_buffer_space_avail. So, it may actually get called too often, but some simple rate limiting per retry request in release_pending_messages would help.

Yeah... though I'm not sure how we'd do that rate limiting. Would an exponential backoff be fine? Also note that OffersMessageHandler::release_pending_messages is called each time next_onion_message_for_peer is called. So if you have a lot of peers, it will be called more frequently than if you have few. I guess ChannelManager knows how many peers it has, so it could take that into account?

We also may want to consider using different reply paths on retries (and maybe the initial sends for that matter) in case the reply path used is the issue.

May 16 '24 16:05 jkczyz

Another alternative could be to extend the onion message handler to signal when we read a message from the peer that retries should be queued.

May 16 '24 20:05 wpaulino

ISTM if we want to avoid this we'll want a new method on ChannelMessageHandler indicating that we've received any message from any peer for any handler, which can then trigger the retry here. This avoids any kind of weird API nonsense where the OnionMessenger is getting the same and then just passing that info to the ChannelManager via the message fetching (to get a retry). This would need to be handled delicately to avoid performance impact, specifically we'll want to avoid a Mutex here and gate the message retrying on a relaxed AtomicBool read.

The other option, which @wpaulino suggested offline, is to hook deeply into the PeerManager, provide it some kind of flag saying "hey, I'm sending an important message here to peer A, please send A a ping immediately on the same socket and let me know if they responded, indicating they received the message cause they responded to a later one", then use that indicator to gate the retry. I'm not sure its worth that level of complexity, though.

May 17 '24 15:05 TheBlueMatt

Hi @TheBlueMatt, I’ve been diving into the approach you suggested and I have a few questions I could use your guidance on:

When retrying, should we be going through all the PaymentPending messages, or is there a way we would be prioritizing which ones to retry first?
How can we make sure we’re not retrying too frequently?
This method seems great for situations where we switch from offline to online (like coming out of the subway), but do you think it’s as effective for other scenarios, like when the receiver is the one going offline?

Thanks a lot for your guidance!

May 24 '24 18:05 shaavan

When retrying, should we be going through all the PaymentPending messages, or is there a way we would be prioritizing which ones to retry first?

I don't see why we wouldn't use the normal flow - it tends to get drained very quickly.

How can we make sure we’re not retrying too frequently?

We should probably only retry once.

This method seems great for situations where we switch from offline to online (like coming out of the subway), but do you think it’s as effective for other scenarios, like when the receiver is the one going offline?

Indeed, it won't help in those cases, however I'm not entirely clear on what we can do for those cases. If the recipient is often-offline, they probably shouldn't be using BOLT12 to receive (pre-async-payments or unless they have a fairly robust notification/store+forward mechanism), and if they are at least somewhat robust trying again in one second seems unlikely to help and more than trying again via a separate path or some small time delta later.

May 28 '24 20:05 TheBlueMatt

I've updated #3010 with the new approach (see comment).

Since we no longer need a sub-minute timer with this new method, this PR has become redundant. So, I'll go ahead and close it. Thanks a lot, everyone, for all the pointers and discussions!

Jun 10 '24 13:06 shaavan

rust-lightning rust-lightning copied to clipboard

Make timer_tick_occurred a second long timer

Codecov Report

rust-lightning
rust-lightning copied to clipboard