lnd
lnd copied to clipboard
routing: fix stuck inflight payments
Replaces #8150, this PR attempts to fix #8146.
Timed out HTLCs are properly handle
The original assumption was, when the HTLC timed out, the attempt result wasn't sent back the payment lifecycle, thus causing the payment to be in stuck. However, this assumption is not true, as shown in the newly added itest - when an outgoing HTLC has timed out, the payment will be marked as failed. Moreover, if the outgoing HTLC is swept by the remote via the preimage path, the payment will be marked as succeeded. All these are expected behavior.
HTLC Attempt result is not saved
If there's already a result in db, GetAttemptResult
will return it, so the result must not be found in db. Also notice that, since there's no result in db, the only line we can hit is:
https://github.com/lightningnetwork/lnd/blob/e6fbaafda4a9cfe08ccee5106aa6dc1da62bcebc/htlcswitch/switch.go#L466
as otherwise hitting this line will return the error ErrPaymentIDNotFound
:
https://github.com/lightningnetwork/lnd/blob/e6fbaafda4a9cfe08ccee5106aa6dc1da62bcebc/htlcswitch/switch.go#L456
And the method GetAttemptResult
is blocking at line:
https://github.com/lightningnetwork/lnd/blob/e6fbaafda4a9cfe08ccee5106aa6dc1da62bcebc/htlcswitch/switch.go#L456
On the other hand, these network results will be cleaned when ChannelRouter
starts here:
https://github.com/lightningnetwork/lnd/blob/e6fbaafda4a9cfe08ccee5106aa6dc1da62bcebc/routing/router.go#L657
But this cleanup only applied to terminated payments. For inflight payments, we always keep the network results. This rules out the possibility that the HTLC attempt once had a result but was later removed.
This seems to leave us only one possibility - that the network result was never written to disk, causing an HTLC attempt waiting for its result forever, hence the payment stayed inflight, which leads to suspicion that this line is hit, thus the line go s.handleLocalResponse(packet)
is skipped so the result is not saved:
https://github.com/lightningnetwork/lnd/blob/e02fd39ce6f38fec635325bf6bb5138b92fdfcfe/htlcswitch/switch.go#L1289-L1293
This PR refactors handlePacketForward
and makes sure the network result is always saved for locally-initiated packet.
cc: @Roasbeef for approach ack
@bhandras: review reminder @yyforyongyu, remember to re-request review from reviewers when ready
lightninglabs-deploy mute 336h
!lightninglabs-deploy mute 336h30m
!lightninglabs-deploy mute 336h30m
Why are you guys muting this? This happens in prod and imo serious
Why are you guys muting this? This happens in prod and imo serious
Hi @kilrau the notifications for reviews were muted as YY focused on other prs, but as you can see the pr is still tagged for 18.3 and will be picked up again soon
Why are you guys muting this? This happens in prod and imo serious
Hi @kilrau the notifications for reviews were muted as YY focused on other prs, but as you can see the pr is still tagged for 18.3 and will be picked up again soon
Getting this into 0.18.3 sounds good 🙏
[!IMPORTANT]
Review skipped
Auto reviews are limited to specific labels.
Labels to auto review (1)
- llm-review
Please check the settings in the CodeRabbit UI or the
.coderabbit.yaml
file in this repository. To trigger a single review, invoke the@coderabbitai review
command.You can disable this status message by setting the
reviews.review_status
tofalse
in the CodeRabbit configuration file.
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-
I pushed a fix in commit <commit_id>.
-
Generate unit testing code for this file.
-
Open a follow-up GitHub issue for this discussion.
-
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitai
in a new review comment at the desired location with your query. Examples:-
@coderabbitai generate unit testing code for this file.
-
@coderabbitai modularize this function.
-
- PR comments: Tag
@coderabbitai
in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:-
@coderabbitai generate interesting stats about this repository and render them as a table.
-
@coderabbitai show all the console.log statements in this repository.
-
@coderabbitai read src/utils.ts and generate unit testing code.
-
@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-
@coderabbitai help me debug CodeRabbit configuration file.
-
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (invoked as PR comments)
-
@coderabbitai pause
to pause the reviews on a PR. -
@coderabbitai resume
to resume the paused reviews. -
@coderabbitai review
to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. -
@coderabbitai full review
to do a full review from scratch and review all the files again. -
@coderabbitai summary
to regenerate the summary of the PR. -
@coderabbitai resolve
resolve all the CodeRabbit review comments. -
@coderabbitai configuration
to show the current CodeRabbit configuration for the repository. -
@coderabbitai help
to get help.
Additionally, you can add @coderabbitai ignore
anywhere in the PR description to prevent this PR from being reviewed.
CodeRabbit Configuration File (.coderabbit.yaml
)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yaml
file to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
@bitromortac: review reminder
Exchanged with the guys from the BOLTZ team, and it seems like they are suffering from another issue in regards of Inflight Payments which I am not sure this PR will solve.
There seems to be an edge case where we fail a payment ( exitresumePayment
) although we have still Inflight HTLCs which need to be resolved. This has some side effects:
When a service tries to use trackpayment => this call remains pending although we already canceled all resultCollector goroutines, so it will pending until the next restart.
Moreover this payment will be inflight until the next restart in general, where we try to recollect all the Inflight HTLCs of a payment.
I think I might have found the culprit based on a logs of a payment attempt which got stuck:
So basically we are in the process of adding a new HTLC Attempt to the payment here:
Code parts taken from this PR.
attempt, err := p.registerAttempt(rt, ps.RemainingAmt)
if err != nil {
return exitWithErr(err)
}
This will however return an error if the payment is failed already in
func (p *controlTower) RegisterAttempt(....
// Check if registering a new attempt is allowed.
if err := payment.Registrable(); err != nil {
return err
}
hence the Payment Lifecycle will exit with (as seen in the picture):
// If the payment is already failed, we won't allow adding more HTLCs.
if m.State.PaymentFailed {
return ErrPaymentPendingFailed
}
I think the problem is For MPP when we hit the MPP Timeout error we would mark a payment prematurely as failed although HTLCs are still pending in handleSwitchErr
Because we would fail a payment as soon as the final hop reported the failure which is the case for the MPP Timeout for example.
I think my analysis describes whats happening in the above log picture, although it's not a full log but just a grep of the payment hash. However I think we really need to be carefull to not prematurely kill the payment lifecycle until all HTLCs reported back, or when killing the payment flow, we should at least also kill the trackpayment
stream so that the subscriber is not waiting forever.
here is another log which shows the MPP-Timeout of one of the HTLCAttempts:
Interesting case @ziggie1984! Cool could you open a new issue to track this instead - think I know what happened, but don't have a good fix, plus I prefer not growing the scope of this PR to stay focused. In short it's due to async fetching of the same payment, and it happens when an invoice times out exactly during the period while the node is still restarting. Would also be great to have raw logs instead of screenshots, as the timestamps were messed up and difficult to analyze.