lnd Payment status: IN_FLIGHT (stuck) - (updated) -> v0.13.1-beta.rc2

trafficstars

Background

We have payments that are stuck in "IN_FLIGHT" even after upgrading from v0.13.0rc3-beta to v0.13.0-beta. Here we will be analyzing just 1 of these payments with payment hash 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30

Your environment

version of lnd v0.13.0-beta
which operating system (uname -a on *Nix)
Ubuntu 20.04
version of btcd, bitcoind, or other backend
bitcoind zmq
any other relevant environment details docker

Steps to reproduce

Unsure, supposedly this was supposed to auto resolve. The expiration is 5000+ blocks after blockheight.

Expected behaviour

Auto resolve to Failed payment I would assume?

Actual behaviour

Still shows IN_FLIGHT using lncli trackpayment

Please see pastebin for details: https://pastebin.com/Ep8Sr2Sn ^ Above we have output of lncli trackpayment and after doing some pagination showing the output of the lncli listpayments

Jun 17 '21 16:06 miketwenty1

Thanks for the details @miketwenty1!

The specific issue we thought this could be and was patched in 0.13 is when a single htlc shard fails, which is not the case here, so this looks like something else.

Could you lookup the following for me:

Is there a htlc with this hash on the outgoing channel 742948803069739015?
Do you have any logs with this payment hash?

I'm carla on the lnd slack if you'd rather dm me this information than post it here, and feel free to xxx out any identifiable info, we just need to see the payment logs.

Jun 18 '21 06:06 carlaKC

I messaged you information on slack, we can discuss there and post our conclusion/ findings here afterwards.

Jun 18 '21 14:06 miketwenty1

Some updates from discussion on slack:

Stuck payment has a single htlc, with expired timelock at 682566 which was mined on 08 May
The htlc is no longer present on the outgoing channel of its route
Logs had been rotated, so we restarted lnd to get fresh payment resume logs:

2021-06-22 15:22:31.266 [INF] CRTR: Resuming payment shard 389058 for payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.327 [DBG] HSWC: Subscribing to result for paymentID=389058

This doesn't look like the stuck in flight issue we've previously seen, because the payment has a single shard and the htlc does not have a failures (both characteristics of the previous bug). Since the payment gets stuck in this state, it seems like we're never getting a result from the switch.

Jun 23 '21 06:06 carlaKC

Update: Talked to @halseth who was involved in the original fix of the stuck HTLC PR (#5214). Though we agree that the payments don't look exactly like those caused by the fixed bug, there are still some similarities (HTLC not around anymore, not getting an update from the switch). That's why we think this might have been caused by the same bug (the payments were made before the first RC of v0.13.0 was released).

Therefore we probably need to look into cleaning up the payment state manually and then observe if this indeed does not happen anymore with v0.13.0.

@yyforyongyu is going to take a look as well, thanks a lot!

Jun 29 '21 15:06 guggero

Yeh agreed @guggero, unless we have a clean slate, then it's hard to determine if this is new, or was fixed and created using an older version that allows this scenario to arise.

Jun 29 '21 22:06 Roasbeef

Read the logs and found the following line, which shows the payment is still stuck in flight under the new version of lnd.

2021-06-22 15:21:36.886 [INF] LTND: Version: 0.13.0-beta commit=v0.13.0-beta, build=production, logging=default, debuglevel=CNCT=debug,HSWC=debug
2021-06-22 15:22:31.176 [INF] CRTR: Resuming payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.266 [INF] CRTR: Resuming payment shard 389058 for payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.327 [DBG] HSWC: Subscribing to result for paymentID=389058
2021-06-22 15:20:55.252 [ERR] CRTR: Resuming payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30 failed: router shutting down.

As @carlaKC mentioned above, from the logs, the collectResult goroutine seems to be stuck at the line, case result, ok = <-resultChan:, which is created here,

// Now ask the switch to return the result of the payment when
// available.
resultChan, err := p.router.cfg.Payer.GetPaymentResult(
    attempt.AttemptID, p.identifier, errorDecryptor,
)
switch {

  // If this attempt ID is unknown to the Switch, it means it was never
  // checkpointed and forwarded by the switch before a restart. In this
  // case we can safely send a new payment attempt, and wait for its
  // result to be available.
  case err == htlcswitch.ErrPaymentIDNotFound:
    log.Debugf("Attempt ID %v for payment %v not found in "+
      "the Switch, retrying.", attempt.AttemptID,
      p.identifier)

    attempt, cErr := p.failAttempt(attempt, err)
    if cErr != nil {
      return nil, cErr
    }

    return &shardResult{
      attempt: attempt,
      err:     err,
    }, nil

The resultChan blocks until a *PaymentResult is received.

Meanwhile inside GetPaymentResult, we have two paths to get the result,

if the payment attempt is not found in the circuit map, we getResult, and will return an error ErrPaymentIDNotFound if that attempt is not found.
if the attempt is found in the circuit map, we subscribeResult, even the attempt is not found, we won't return ErrPaymentIDNotFound, and the returned channel resultChan will wait forever.

This could be a problem if the attempt cannot be found (maybe cleaned using htlcswitch.CleanStore), yet we use the second path inside GetPaymentResult. The error ErrPaymentIDNotFound won't be returned, and we could never reach the above line case err == htlcswitch.ErrPaymentIDNotFound:, which is likely to generate similar logs mentioned above.

Remaining questions (notes for myself),

How could we land on the first path in GetPaymentResult?
How/when to clean the circuit map?
How/when is the status of HTLC attempt modified?

Jun 30 '21 14:06 yyforyongyu

A possible mitigation is to put a timeout while the router is waiting on resultChan. Meanwhile, yeah, we could just mark the old in-flight payments as failed, since their HTLC is far behind its deadline.

Jun 30 '21 14:06 yyforyongyu

What remains unclear here is, we should have seen the log printed from this line since the call of CollectResultAsync is unblocking. Unless payment, currentState, err := p.fetchPaymentState() or shardHandler.checkShards() blocks. I've messaged @miketwenty1 on slack hoping to get more recent logs, thanks!

Jun 30 '21 14:06 yyforyongyu

I did respond to @yyforyongyu, I look forward to working with you. I provided the information you asked for on slack.

Jun 30 '21 15:06 miketwenty1

above was a fat finger.

Jun 30 '21 16:06 miketwenty1

Yeh agreed @guggero, unless we have a clean slate, then it's hard to determine if this is new, or was fixed and created using an older version that allows this scenario to arise.

I know there was some concern about whether new ones were coming in or if this is an artifact of previous version.. It seems for every IN_FLIGHT stuck payment, the "creation_date" unixtimestamp is May 7th 2021 or earlier. https://pastebin.com/Y92t3Cc0 We currently have 676 IN_FLIGHT payments for this node.

Jun 30 '21 18:06 miketwenty1

Chatted with @michael1011 , all the stuck payments happened before the fix, in particular, the stats are,

2020-10-07 00:00:00 :  1
2020-12-20 00:00:00 :  1
2020-12-26 00:00:00 :  1
2021-02-17 00:00:00 :  1
2021-02-20 00:00:00 :  1
2021-03-03 00:00:00 :  1
2021-05-06 00:00:00 :  1
2020-09-13 00:00:00 :  2
2021-01-25 00:00:00 :  2
2021-02-07 00:00:00 :  2
2021-02-14 00:00:00 :  2
2021-03-17 00:00:00 :  2
2021-04-14 00:00:00 :  2
2020-12-01 00:00:00 :  3
2021-02-16 00:00:00 :  3
2021-04-15 00:00:00 :  3
2021-04-16 00:00:00 :  3
2020-11-15 00:00:00 :  4
2021-05-05 00:00:00 :  4
2021-02-09 00:00:00 :  6
2021-04-20 00:00:00 :  8
2021-04-27 00:00:00 :  8
2021-04-17 00:00:00 :  16
2021-03-07 00:00:00 :  33
2021-02-28 00:00:00 :  67
2021-04-28 00:00:00 :  142
2021-05-07 00:00:00 :  144
2021-04-21 00:00:00 :  213

I think it's been fixed by #5214 , however, it seems that we didn't clean the already stuck payments that are dangling there before the fix. I'll draft a PR to address this issue.

Jun 30 '21 19:06 yyforyongyu

We currently have 676 IN_FLIGHT payments for this node.

@miketwenty1 Just to be sure, all these payments should have been marked as failed right?

Jun 30 '21 19:06 yyforyongyu

@yyforyongyu, just fyi.. im @miketwenty1 You tagged the wrong Michael there. 😂

We currently have 676 IN_FLIGHT payments for this node.

@michael1011 Just to be sure, all these payments should have been marked as failed right?

Every "failure_reason" is "FAILURE_REASON_NONE"

Jun 30 '21 19:06 miketwenty1

running v0.13.1-beta.rc2 did not resolve this issue @yyforyongyu (updating title of issue).

Jul 14 '21 16:07 miketwenty1

running v0.13.1-beta.rc2 did not resolve this issue @yyforyongyu (updating title of issue).

Thanks for the update! This will be fixed in two steps,

clean the circuit maps for closed channels #4364
timeout payments when loading them from disk #5461

Jul 19 '21 12:07 yyforyongyu

Just adding a quick note that this is affecting one of my withdrawal attempts from a LN exchange (southxchange).

Sep 25 '21 10:09 jackwhite420

quick note: 2 withdrawal attempts with this issue in bitcoin beach wallet https://github.com/lightningnetwork/lnd/issues/4788#issuecomment-1001669692

Can someone confirm if it is safe to return the funds? (one of the txs is for more than 1 million sats so we need to be extra sure)

Dec 27 '21 17:12 dolcalmi

What do you mean by "return the funds"?

Dec 27 '21 18:12 yyforyongyu

What do you mean by "return the funds"?

The issue is with a payment done by one of the users of Bitcoin Beach Wallet... so we have the tx as pending, i.e, the amount was discounted from user balance. We need to know if it is safe to void the transaction (user tx)

Dec 27 '21 18:12 dolcalmi

lnd lnd copied to clipboard

Payment status: IN_FLIGHT (stuck) - (updated) -> v0.13.1-beta.rc2

Background

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

lnd
lnd copied to clipboard