lnd icon indicating copy to clipboard operation
lnd copied to clipboard

routing: fix stuck inflight payments

Open yyforyongyu opened this issue 1 year ago • 4 comments

Replaces #8150, this PR attempts to fix #8146.

Timed out HTLCs are properly handle

The original assumption was, when the HTLC timed out, the attempt result wasn't sent back the payment lifecycle, thus causing the payment to be in stuck. However, this assumption is not true, as shown in the newly added itest - when an outgoing HTLC has timed out, the payment will be marked as failed. Moreover, if the outgoing HTLC is swept by the remote via the preimage path, the payment will be marked as succeeded. All these are expected behavior.

HTLC Attempt result is not saved

If there's already a result in db, GetAttemptResult will return it, so the result must not be found in db. Also notice that, since there's no result in db, the only line we can hit is: https://github.com/lightningnetwork/lnd/blob/e6fbaafda4a9cfe08ccee5106aa6dc1da62bcebc/htlcswitch/switch.go#L466 as otherwise hitting this line will return the error ErrPaymentIDNotFound: https://github.com/lightningnetwork/lnd/blob/e6fbaafda4a9cfe08ccee5106aa6dc1da62bcebc/htlcswitch/switch.go#L456 And the method GetAttemptResult is blocking at line: https://github.com/lightningnetwork/lnd/blob/e6fbaafda4a9cfe08ccee5106aa6dc1da62bcebc/htlcswitch/switch.go#L456

On the other hand, these network results will be cleaned when ChannelRouter starts here: https://github.com/lightningnetwork/lnd/blob/e6fbaafda4a9cfe08ccee5106aa6dc1da62bcebc/routing/router.go#L657

But this cleanup only applied to terminated payments. For inflight payments, we always keep the network results. This rules out the possibility that the HTLC attempt once had a result but was later removed.

This seems to leave us only one possibility - that the network result was never written to disk, causing an HTLC attempt waiting for its result forever, hence the payment stayed inflight, which leads to suspicion that this line is hit, thus the line go s.handleLocalResponse(packet) is skipped so the result is not saved: https://github.com/lightningnetwork/lnd/blob/e02fd39ce6f38fec635325bf6bb5138b92fdfcfe/htlcswitch/switch.go#L1289-L1293

This PR refactors handlePacketForward and makes sure the network result is always saved for locally-initiated packet.

yyforyongyu avatar Nov 14 '23 06:11 yyforyongyu

cc: @Roasbeef for approach ack

saubyk avatar Nov 28 '23 17:11 saubyk

@bhandras: review reminder @yyforyongyu, remember to re-request review from reviewers when ready

lightninglabs-deploy avatar May 16 '24 13:05 lightninglabs-deploy

lightninglabs-deploy mute 336h

yyforyongyu avatar Jun 14 '24 06:06 yyforyongyu

!lightninglabs-deploy mute 336h30m

yyforyongyu avatar Jun 26 '24 13:06 yyforyongyu

!lightninglabs-deploy mute 336h30m

yyforyongyu avatar Jul 10 '24 15:07 yyforyongyu

Why are you guys muting this? This happens in prod and imo serious

kilrau avatar Jul 20 '24 10:07 kilrau

Why are you guys muting this? This happens in prod and imo serious

Hi @kilrau the notifications for reviews were muted as YY focused on other prs, but as you can see the pr is still tagged for 18.3 and will be picked up again soon

saubyk avatar Jul 20 '24 16:07 saubyk

Why are you guys muting this? This happens in prod and imo serious

Hi @kilrau the notifications for reviews were muted as YY focused on other prs, but as you can see the pr is still tagged for 18.3 and will be picked up again soon

Getting this into 0.18.3 sounds good 🙏

kilrau avatar Jul 20 '24 18:07 kilrau

[!IMPORTANT]

Review skipped

Auto reviews are limited to specific labels.

Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot] avatar Jul 22 '24 13:07 coderabbitai[bot]

@bitromortac: review reminder

lightninglabs-deploy avatar Jul 29 '24 14:07 lightninglabs-deploy

Exchanged with the guys from the BOLTZ team, and it seems like they are suffering from another issue in regards of Inflight Payments which I am not sure this PR will solve.

There seems to be an edge case where we fail a payment ( exitresumePayment) although we have still Inflight HTLCs which need to be resolved. This has some side effects:

When a service tries to use trackpayment => this call remains pending although we already canceled all resultCollector goroutines, so it will pending until the next restart.

Moreover this payment will be inflight until the next restart in general, where we try to recollect all the Inflight HTLCs of a payment.

I think I might have found the culprit based on a logs of a payment attempt which got stuck:

image_2024-08-01_19-09-16

So basically we are in the process of adding a new HTLC Attempt to the payment here:

Code parts taken from this PR.

attempt, err := p.registerAttempt(rt, ps.RemainingAmt)
		if err != nil {
			return exitWithErr(err)
		}

This will however return an error if the payment is failed already in func (p *controlTower) RegisterAttempt(....

// Check if registering a new attempt is allowed.
		if err := payment.Registrable(); err != nil {
			return err
		}

hence the Payment Lifecycle will exit with (as seen in the picture):


	// If the payment is already failed, we won't allow adding more HTLCs.
	if m.State.PaymentFailed {
		return ErrPaymentPendingFailed
	}

I think the problem is For MPP when we hit the MPP Timeout error we would mark a payment prematurely as failed although HTLCs are still pending in handleSwitchErr

Because we would fail a payment as soon as the final hop reported the failure which is the case for the MPP Timeout for example.

I think my analysis describes whats happening in the above log picture, although it's not a full log but just a grep of the payment hash. However I think we really need to be carefull to not prematurely kill the payment lifecycle until all HTLCs reported back, or when killing the payment flow, we should at least also kill the trackpayment stream so that the subscriber is not waiting forever.

ziggie1984 avatar Aug 01 '24 22:08 ziggie1984

here is another log which shows the MPP-Timeout of one of the HTLCAttempts:

image_2024-08-01_19-24-22

ziggie1984 avatar Aug 01 '24 22:08 ziggie1984

Interesting case @ziggie1984! Cool could you open a new issue to track this instead - think I know what happened, but don't have a good fix, plus I prefer not growing the scope of this PR to stay focused. In short it's due to async fetching of the same payment, and it happens when an invoice times out exactly during the period while the node is still restarting. Would also be great to have raw logs instead of screenshots, as the timestamps were messed up and difficult to analyze.

yyforyongyu avatar Aug 02 '24 23:08 yyforyongyu