lnd [bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight

Remarked in https://github.com/lightningnetwork/lnd/pull/8174

Analysed some of the behavior here:

https://github.com/lightningnetwork/lnd/pull/8174#issuecomment-2264125375

What I think might be taken into consideration to change (from my point of understanding the codebase here)

When sending a payment LND will by default attempt MPP payment up to 16 shards per payment. We launch all of the asynchronously however when during the process one of the HTLC attempts fails because of specific reasons e.g. requesting a route fails or we cannot register a payment attempt:

https://github.com/lightningnetwork/lnd/blob/7e1e0545113d7361ed1651e01003c9d30dcc42c9/routing/payment_lifecycle.go#L258-L280

we abort the payment cycle and don't wait for INFLIGHT HTLC attempts to be resolved, we will only reconsider them during a restart.

One of the examples described in https://github.com/lightningnetwork/lnd/pull/8174#issuecomment-2264125375 is that as soon as the MPP times-out at the receiver side, but we are in the process of sending another HTLC, we will fail the payment lifecycle without resolving all the INFLIGHT htlcs.

This has downsides:

On the one hand the payment will always be inflight and service providers using the API might not be aware that this payment has no chance of resolving successfully because the MPPTimeout has already been reached for example.
Moreover when described to the Trackpayment API the payment stays INFLIGHT/PENDING forever.

Possible solutions, interested in feedback here:

I think it makes sense to store the HTLC Attempts in Memory during the excecution of the payment so that the payment lifecycle can always be aware how many HTLCs have been sent ? I store them in the DB but I think it might be worthwhile to also have the easy accessible in life cycle struct maybe ? (Currently we only have 1 golang channel which listens to potential results of all the result collectors)

Aug 05 '24 10:08 ziggie1984

I think the question is, should we ever abort the payment lifecycle even when there are inflight HTLCs? The answer is yes, but only if we get a non-MPP-related error, such as a DB-related error, or other lower-level critical errors.

The previous assumption is, we should only rely on this func to decide if we want to abort the loop, https://github.com/lightningnetwork/lnd/blob/ab96de3f074e008a019e731701c9cca934fa1b11/routing/payment_lifecycle.go#L233-L237

which works most of the time given the setup that,

the invoice has a timeout of 60s
finding a route via RequestRoute is fast (in million seconds)
creating and sending the HTLC attempt is fast.

but it can break if,

one or more HTLCs are in flight.
the sender node shuts down before the invoice times out and restarts after the invoice times out.
we call decideNextStep before the payment is marked as failed, and proceed to register another attempt.
we call RegisterAttempt, now it fails as we cannot register attempts on a failed payment.

The root cause is we have two different views of the payment in two goroutines,

G1: in resumePayment, we always fetch the latest payment state for each iteration, and make decisions based on this state.
G2: in collectResultAsync, we will alter the payment state based on the collected result from htlcswitch.

The simple fix is to always read and write to the DB in the same goroutine, which should happen in resumePayment. Since the only slow action is to collect results from htlcswitch via, https://github.com/lightningnetwork/lnd/blob/ab96de3f074e008a019e731701c9cca934fa1b11/routing/payment_lifecycle.go#L509-L511

A better fix is to refactor deeper to make the DB actions clearer, which I think can be accomplished in the upcoming native payment SQL PRs.

I think it makes sense to store the HTLC Attempts in Memory during the excecution of the payment so that the payment lifecycle can always be aware how many HTLCs have been sent ?

The more we rely on ephemeral states the less fault tolerant it becomes. Suppose a power outage happens all the results will be lost - unless we can reconstruct them, say, if htlcswitch has stored it on disk.

Aug 06 '24 07:08 yyforyongyu

I think the question is, should we ever abort the payment lifecycle even when there are inflight HTLCs? The answer is yes, but only if we get a non-MPP-related error, such as a DB-related error, or other lower-level critical errors.

Hmm I tend to rather make sure we resolve all INFLIGHT HTLCs even if we have a lower-level critical error so that we make sure we mark this Payment and all its HTLCs as somehow unrecoverable, so that at least the trackpayment cmd signals the exit of the payment correctly.

but it can break if,

one or more HTLCs are in flight. the sender node shuts down before the invoice times out and restarts after the invoice times out. we call decideNextStep before the payment is marked as failed, and proceed to register another attempt. we call RegisterAttempt, now it fails as we cannot register attempts on a failed payment.

Do you mean those steps all must happen in sequential order for the payment lifecycle to break or does every point represents a case where the payment lifecycle fails ? I guess it's the former but I don't think Point 2 is necessary for it to break ?

The simple fix is to always read and write to the DB in the same goroutine, which should happen in resumePayment. Since the only slow action is to collect results from htlcswitch via,

Depending on when the Payment Sql refactor is planned for I would prefer going with a simple fix first and then proceeding. Your recommendation of writing to the db in resumepayment is definitely a good way to solve it and its seems not super complicated.

While looking through the code I am wondering why we mutate the state of the payment in a fetchPayment method:

https://github.com/lightningnetwork/lnd/blob/ab96de3f074e008a019e731701c9cca934fa1b11/channeldb/payments.go#L274

https://github.com/lightningnetwork/lnd/blob/ab96de3f074e008a019e731701c9cca934fa1b11/channeldb/payments.go#L314-L319

I think we should not change the state in a fetch method or we should call this fetch method update wdyt ?

Aug 06 '24 10:08 ziggie1984

This keeps happening for us on a regular basis on mainnet. There is no way to prevent it apart from not using MPP and even that is just an educated guess.

Aug 27 '24 11:08 michael1011

Following this issue at somewhat of a distance at the moment, but very curious to see the native payment SQL PRs and any continued development on @yyforyongyu's comment:

unless we can reconstruct them, say, if htlcswitch has stored it on disk.

Coming from the perspective of running the ChannelRouter remotely from the Switch and allowing communication via RPC, I have found myself getting the feeling that Switch store might need some upgrades. In such a scenario, you have the ChannelRouter driving state transitions about the status of a locally initiated payment (eg: non-forward) using state maintained by a remote Switch. The existing networkResultStore may not be quite sufficient for this.

Oct 04 '24 16:10 calvinrzachman

This keeps happening for us on a regular basis on mainnet. There is no way to prevent it apart from not using MPP and even that is just an educated guess.

BTW we still have several cases per week, we can't wait to have this fix in. If you need anything from us, don't be shy!

Nov 21 '24 15:11 kilrau

#9150 should fix this issue.

@kilrau maybe you could provide some relevant debug logs? Just to make sure #9150 covers it.

Nov 21 '24 17:11 yyforyongyu

Seems like we are affected by this problem as well. We have seen some payments, using MPP, get stuck as InFlight on our nodes. The pattern so far is that there are failed HTLCs, and one or more HTLC which is stuck as InFlight, even after it expires. Restarting the nodes resolved these payments to a Failed state.

Is there any other reasonable workaround, since restarting the nodes can be painful?

Nov 22 '24 10:11 nordbjorn

Also experiencing this issue

Jan 16 '25 17:01 lnproxy