lnd icon indicating copy to clipboard operation
lnd copied to clipboard

routing+htlcswitch: fix stuck inflight payments

Open yyforyongyu opened this issue 1 year ago • 1 comments

Fix #8975

This PR aims at fixing a scenario where the payment could be stuck. Major changes are,

  • all the db updates for a given payment now happens in a single goroutine (resumePayment)
  • minor optimization on sphinx circuit creation - the error decryptor is now conditionally created when needed, i.e., when processing an update_fail_htlc.

TODOs

  • [x] fix unit test in htlcswitch
  • [x] fix unit test in routing

yyforyongyu avatar Oct 02 '24 13:10 yyforyongyu

[!IMPORTANT]

Review skipped

Auto reviews are limited to specific labels.

:label: Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot] avatar Oct 02 '24 13:10 coderabbitai[bot]

if terminal { close(subscriber.queue.ChanIn()) }

because htlcs are still in flight, the payment state is not terminated so we will not close the channel here, given the fact that we still have inflight_htlcs. However we will close all result_collectors which makes it impossible for the track_payment stream to report back.

ziggie1984 avatar Oct 28 '24 14:10 ziggie1984

given the fact that we still have inflight_htlcs. However we will close all result_collectors which makes it impossible for the track_payment stream to report back.

I think that's not the case anymore since we fixed the stuck payment here? Also even if it's not fixed before re the stuckness, the track payment IMO is behaving correctly because it shouldn't quit unless the payment is terminated.

yyforyongyu avatar Oct 28 '24 17:10 yyforyongyu

So I analysed our payment behaviour and found out, that we would fail a payment without waiting for all result collectors. This would lead also to the case that trackPayment subscriber would be hang indefintily. This is now solved by waiting for all results and updating the payment db. Let me know if this approach is good @yyforyongyu

Thoughts about the current change.

So the reason why we should wait for all shards to resolve is the following: If we launch an MPP payment for example and receive the first error for one of the Shards, we would mark the payment as failed, closing all the collectors in the process. THis means our payment has never the chance to resolve and the trackPayment subscribers will wait indefinitly because the payment never reached a terminal condition (still inflight payments).

ziggie1984 avatar Nov 03 '24 21:11 ziggie1984

So the reason why we should wait for all shards to resolve is the following

The assessment is correct, however I don't think we want to add another state to decide whether we should exit the payment lifecycle or not, instead we should stick to p.decideNextStep so we only have a single place to act this logic, which I pushed a fix for this case.

So I analysed our payment behaviour and found out, that we would fail a payment without waiting for all result collectors.

I did try to write an itest for this case but failed. Could you elaborate on this case?

Also I'd like to limit the scope of this PR to focus on fixing this case only, as stated in the issue description. Given the complexity of the payment lifecycle, I think it's better to focus on one issue at a time, and open PRs for other fixes.

Moving forward, I think the handleSwitchErr may need a closer look, https://github.com/lightningnetwork/lnd/blob/22ae3e5ddd2abc7b06302b17a9368b4aa191aec7/routing/payment_lifecycle.go#L897-L903

We need to revisit the errors returned from the link and decide which error causes an attempt failure and which causes a payment failure. I suspect we are prematurely marking the payment as failed when it should instead mark the attempt as failed.

yyforyongyu avatar Nov 04 '24 06:11 yyforyongyu

Also I'd like to limit the scope of this PR to focus on fixing https://github.com/lightningnetwork/lnd/issues/8975#issuecomment-2270528222 only, as stated in the issue description. Given the complexity of the payment lifecycle, I think it's better to focus on one issue at a time, and open PRs for other fixes.

Looking at your proposal it looks like it might solve the issue, because we would always wait in the decideNextStep. I am more in favour of not doing so much exceptions and special casing a lot of errors, I would rather prefer the approach were we error out and just wait for all collectors everytime but I can agree on both solutions.

ziggie1984 avatar Nov 04 '24 07:11 ziggie1984

I am more in favour of not doing so much exceptions and special casing a lot of errors, I would rather prefer the approach were we error out and just wait for all collectors everytime but I can agree on both solutions.

I actually think it's quite the opposite - if we are using another counter and read the chan it creates more questions, e.g., what if there are multiple results being sent to resultCollected? what if resultCollected is empty and blocking? By using the counter approach not only do we need to create more states to deal with, but we also doge the issue in decideNextStep or registerAttempt, and likely to hide future issues - the commit I put there is a hack, but it doesn't pretend the issue is fixed - we need to go deep into the switch and understand when to fail the payment vs the attempt.

That being said, I'm considering removing it too as it's hacky. As for the edge case, say a payment made of three HTLCs,

  1. two of the HTLCs are sent, the third one is blocking on requestRoute
  2. one of the already sent HTLCs has failed, causing the payment to be marked as failed
  3. when sending the third one, the lifecycle loop will error out because registerAttempt will fail due to Registrable returning an error.

If we can confirm that step 2 indeed happens, then we can provide a solid fix to make sure we don't fail the payment prematurely, then we don't need the hack. On the other hand, by using the counter approach, we won't notice this issue and won't attempt a fix.

yyforyongyu avatar Nov 04 '24 08:11 yyforyongyu

two of the HTLCs are sent, the third one is blocking on requestRoute

could you elaborate why we would block at requestRoute imo we fetch the route there and if its not available we fail ? We only block in the decideNextStep or ?

Yeah I think your example might happen not sure how often tho but it seems reasonable to assume.

My current concern and maybe you can jump in and mitigate this case is the following:

Our SendPaymentV2 probably the most used case for sending payments, never listens to the outcome of the payment_lifecycle but it subscribes to the payment and listens to the control_tower to signal the terminal state. So in my opinion we should make sure the control_tower always is able to resolve the payment hence waiting for all shards, and not let the payment_lifecycle terminate the ability for the control_tower to resolve payments by closing the result collectors.

So I am not sure here, maybe we should really invest the time and find some proper solution rather than get this fix it. Maybe my concerns are overrated and we never run in those edge cases that often anymore given the fact that we know sync the payment in one loop ...

ziggie1984 avatar Nov 04 '24 09:11 ziggie1984

could you elaborate why we would block at requestRoute imo we fetch the route there and if its not available we fail ? We only block in the decideNextStep or ?

I also don't know how it could happen😂, just making assumptions. If requestRoute always returns quickly, then I doubt we will ever run into this edge case because we'd send out all three HTLCs immediately.

but it subscribes to the payment and listens to the control_tower to signal the terminal state.

now that you mention it I think the fix is to notify it in the payment lifecycle, something like this.

yyforyongyu avatar Nov 04 '24 11:11 yyforyongyu

now that you mention it I think the fix is to notify it in the payment lifecycle, something like this.

I think this can also be a solution. Not super happy with it, because we kinda prevent the payment from being marked as failed until a restart but at least the subscriber does not hang indefinitely.

Maybe this issue should be taken care of in the payment sql refactor I am doing and let's keep a TODO here ?. However @hieblmi might need this feature for his static address functionality.

i think we have all the cards on the table, just need to decide which direction to go. (cc @saubyk I think we need to discuss this one in the after next standup)

ziggie1984 avatar Nov 04 '24 11:11 ziggie1984

!lightninglabs-deploy mute 336h

yyforyongyu avatar Nov 11 '24 12:11 yyforyongyu

@yyforyongyu, remember to re-request review from reviewers when ready

lightninglabs-deploy avatar Jan 20 '25 14:01 lightninglabs-deploy

I still think that we should never really abort the payment lifecycle as long as we have pending htlcs we are awaiting the result from.

Yeah I think that's another direction we can aim at. Atm the canceling logic seems to need more work - if we want to properly cancel a payment, we should also send a fail htlc msg to the peer, wait for the commitment dance to finish, then we can safely exit and mark the payment as canceled. This way the user can be sure the HTLCs won't ever be settled.

yyforyongyu avatar Jan 27 '25 09:01 yyforyongyu

Atm the canceling logic seems to need more work - if we want to properly cancel a payment, we should also send a fail htlc msg to the peer, wait for the commitment dance to finish, then we can safely exit and mark the payment as canceled. This way the user can be sure the HTLCs won't ever be settled.

I may be misinterpreting this comment, but: it isn't possible for us to send a fail HTLC to the peer for an outgoing payment. We can only wait for them to resolve the payment, or for us to go onchain.

Thinking a bit more about the on-chain case: do we handle the case where a channel with HTLCs for an outgoing payment goes on chain (tho IMO we should never actually go onchain ourselves for an outgoing payment: https://github.com/lightningnetwork/lnd/issues/8444)? In terms of payment failure/success, it's possible that we go on chain, and all the HTLCs are settled, leading to a payment success (albeit a very slow one).

Roasbeef avatar Feb 06 '25 02:02 Roasbeef

Inspired by @Roasbeef 's suggestion on using mutex, I realized a simpler and maybe more "correct" way to handle the attempt result. Previously when a result is returned from the switch, we would perform two actions,

  • save it to a switch result map, and process it later during the next lifecycle iteration.
  • send a signal to p.resultCollected so p.decideNextStep can continue.

I realize these two actions are responding to the same fact that the result is ready, plus processing the same event at two places is likely to cause inconsistency in the future, so a better approach is to simply send the result to the channel p.resultCollected, and handle it there, which is implemented in the last fixup commit.

yyforyongyu avatar Feb 07 '25 13:02 yyforyongyu

Do we have to do anything to unstick something? Both the nodes I manage have payments stuck for years now.

I just upgraded both to v0.18.5-beta and they're still stuck. This is my current winner at 1051 days.

{
    "level": "INFO",
    "human_time": "11:46:35.523 Wed 19 Feb",
    "message": "⛱️ Payment:  16397 amount:     99,933 sat dest: 02c561c74e pre_image: 0000000000000000000000000000000000000000000000000000000000000000 in flight: 1051 days, 21 hours 14:11:24 status: IN_FLIGHT 0b8745af34b592bccf9874e310cf284c077e4c163197bbd4ef57909deb38fef0",
    "timestamp": "2025-02-19T11:46:35.523000+00:00",
    "logger": "lnd_monitor_v2",
    "module": "lnd_monitor_v2",
    "function": "payment_report",
    "line": 286,
    "thread_name": "MainThread",
    "payment": {
        "payment_hash": "0b8745af34b592bccf9874e310cf284c077e4c163197bbd4ef57909deb38fef0",
        "value": "99933",
        "creation_date": "1649081484",
        "payment_preimage": "0000000000000000000000000000000000000000000000000000000000000000",
        "value_sat": "99933",
        "value_msat": "99933000",
        "payment_request": "lnbc999330n1p3y4lr4pp5pwr5tte5kkftenucwn33pnegfsrhunqkxxtmh48027gfm6eclmcqdqv23jhxarfdenscqzpgxqrrsssp58rhr63w4fd40ju64n9nd32pp72lfr3zdmwau629dpapfdzlw098s9qyyssqlxstd8n6fjz2h8zts670tccvzj72etcxzt20rga6yfhc6gm92mry6pjz2j3d8wjkrt7ad9cw3vzqvqtj837595r3r6mj4tthedwjuqsq2tum0y",
        "status": "IN_FLIGHT",
        "creation_time_ns": "1649081484390479855",
        "htlcs": [
            {
                "route": {
                    "total_time_lock": 730447,
                    "total_amt": "99933",
                    "hops": [
                        {
                            "chan_id": "772300265944121345",
                            "chan_capacity": "99933",
                            "amt_to_forward": "99933",
                            "expiry": 730447,
                            "amt_to_forward_msat": "99933000",
                            "pub_key": "02c561c74e33ff91a8d6e85f6ad53217166f17a65cfbdbbb176ec89eb2967f3c19",
                            "tlv_payload": true,
                            "mpp_record": {
                                "total_amt_msat": "99933000",
                                "payment_addr": "OO49RdVLavlzVZlm2Kgh8r6RxE3bu80orQ9ClovueU8="
                            }
                        }
                    ],
                    "total_amt_msat": "99933000"
                },
                "attempt_time_ns": "1649081484441813502",
                "attempt_id": "33007"
            }
        ],
        "payment_index": "16397"
    }
}

brianoflondon avatar Feb 19 '25 15:02 brianoflondon

This PR wasn't part of 0.18.5. It's in the 0.19 milestone.

guggero avatar Feb 19 '25 15:02 guggero

Ahhh thanks, sorry missed that.

brianoflondon avatar Feb 19 '25 15:02 brianoflondon