lnd
lnd copied to clipboard
Payment status: IN_FLIGHT (stuck) - (updated) -> v0.13.1-beta.rc2
Background
We have payments that are stuck in "IN_FLIGHT" even after upgrading from v0.13.0rc3-beta to v0.13.0-beta.
Here we will be analyzing just 1 of these payments with payment hash 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
Your environment
- version of
lndv0.13.0-beta - which operating system (
uname -aon *Nix) - Ubuntu 20.04
- version of
btcd,bitcoind, or other backend - bitcoind zmq
- any other relevant environment details docker
Steps to reproduce
Unsure, supposedly this was supposed to auto resolve. The expiration is 5000+ blocks after blockheight.
Expected behaviour
Auto resolve to Failed payment I would assume?
Actual behaviour
Still shows IN_FLIGHT using lncli trackpayment
Please see pastebin for details: https://pastebin.com/Ep8Sr2Sn
^ Above we have output of lncli trackpayment and after doing some pagination showing the output of the lncli listpayments
Thanks for the details @miketwenty1!
The specific issue we thought this could be and was patched in 0.13 is when a single htlc shard fails, which is not the case here, so this looks like something else.
Could you lookup the following for me:
- Is there a htlc with this hash on the outgoing channel
742948803069739015? - Do you have any logs with this payment hash?
I'm carla on the lnd slack if you'd rather dm me this information than post it here, and feel free to xxx out any identifiable info, we just need to see the payment logs.
I messaged you information on slack, we can discuss there and post our conclusion/ findings here afterwards.
Some updates from discussion on slack:
- Stuck payment has a single htlc, with expired timelock at 682566 which was mined on 08 May
- The htlc is no longer present on the outgoing channel of its route
- Logs had been rotated, so we restarted lnd to get fresh payment resume logs:
2021-06-22 15:22:31.266 [INF] CRTR: Resuming payment shard 389058 for payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.327 [DBG] HSWC: Subscribing to result for paymentID=389058
This doesn't look like the stuck in flight issue we've previously seen, because the payment has a single shard and the htlc does not have a failures (both characteristics of the previous bug). Since the payment gets stuck in this state, it seems like we're never getting a result from the switch.
Update: Talked to @halseth who was involved in the original fix of the stuck HTLC PR (#5214). Though we agree that the payments don't look exactly like those caused by the fixed bug, there are still some similarities (HTLC not around anymore, not getting an update from the switch). That's why we think this might have been caused by the same bug (the payments were made before the first RC of v0.13.0 was released).
Therefore we probably need to look into cleaning up the payment state manually and then observe if this indeed does not happen anymore with v0.13.0.
@yyforyongyu is going to take a look as well, thanks a lot!
Yeh agreed @guggero, unless we have a clean slate, then it's hard to determine if this is new, or was fixed and created using an older version that allows this scenario to arise.
Read the logs and found the following line, which shows the payment is still stuck in flight under the new version of lnd.
2021-06-22 15:21:36.886 [INF] LTND: Version: 0.13.0-beta commit=v0.13.0-beta, build=production, logging=default, debuglevel=CNCT=debug,HSWC=debug
2021-06-22 15:22:31.176 [INF] CRTR: Resuming payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.266 [INF] CRTR: Resuming payment shard 389058 for payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.327 [DBG] HSWC: Subscribing to result for paymentID=389058
2021-06-22 15:20:55.252 [ERR] CRTR: Resuming payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30 failed: router shutting down.
As @carlaKC mentioned above, from the logs, the collectResult goroutine seems to be stuck at the line, case result, ok = <-resultChan:, which is created here,
// Now ask the switch to return the result of the payment when
// available.
resultChan, err := p.router.cfg.Payer.GetPaymentResult(
attempt.AttemptID, p.identifier, errorDecryptor,
)
switch {
// If this attempt ID is unknown to the Switch, it means it was never
// checkpointed and forwarded by the switch before a restart. In this
// case we can safely send a new payment attempt, and wait for its
// result to be available.
case err == htlcswitch.ErrPaymentIDNotFound:
log.Debugf("Attempt ID %v for payment %v not found in "+
"the Switch, retrying.", attempt.AttemptID,
p.identifier)
attempt, cErr := p.failAttempt(attempt, err)
if cErr != nil {
return nil, cErr
}
return &shardResult{
attempt: attempt,
err: err,
}, nil
The resultChan blocks until a *PaymentResult is received.
Meanwhile inside GetPaymentResult, we have two paths to get the result,
- if the payment attempt is not found in the circuit map, we
getResult, and will return an errorErrPaymentIDNotFoundif that attempt is not found. - if the attempt is found in the circuit map, we
subscribeResult, even the attempt is not found, we won't returnErrPaymentIDNotFound, and the returned channelresultChanwill wait forever.
This could be a problem if the attempt cannot be found (maybe cleaned using htlcswitch.CleanStore), yet we use the second path inside GetPaymentResult. The error ErrPaymentIDNotFound won't be returned, and we could never reach the above line case err == htlcswitch.ErrPaymentIDNotFound:, which is likely to generate similar logs mentioned above.
Remaining questions (notes for myself),
- How could we land on the first path in
GetPaymentResult? - How/when to clean the circuit map?
- How/when is the status of HTLC attempt modified?
A possible mitigation is to put a timeout while the router is waiting on resultChan. Meanwhile, yeah, we could just mark the old in-flight payments as failed, since their HTLC is far behind its deadline.
What remains unclear here is, we should have seen the log printed from this line since the call of CollectResultAsync is unblocking. Unless payment, currentState, err := p.fetchPaymentState() or shardHandler.checkShards() blocks. I've messaged @miketwenty1 on slack hoping to get more recent logs, thanks!
I did respond to @yyforyongyu, I look forward to working with you. I provided the information you asked for on slack.
above was a fat finger.
Yeh agreed @guggero, unless we have a clean slate, then it's hard to determine if this is new, or was fixed and created using an older version that allows this scenario to arise.
I know there was some concern about whether new ones were coming in or if this is an artifact of previous version.. It seems for every IN_FLIGHT stuck payment, the "creation_date" unixtimestamp is May 7th 2021 or earlier. https://pastebin.com/Y92t3Cc0 We currently have 676 IN_FLIGHT payments for this node.
Chatted with @michael1011 , all the stuck payments happened before the fix, in particular, the stats are,
2020-10-07 00:00:00 : 1
2020-12-20 00:00:00 : 1
2020-12-26 00:00:00 : 1
2021-02-17 00:00:00 : 1
2021-02-20 00:00:00 : 1
2021-03-03 00:00:00 : 1
2021-05-06 00:00:00 : 1
2020-09-13 00:00:00 : 2
2021-01-25 00:00:00 : 2
2021-02-07 00:00:00 : 2
2021-02-14 00:00:00 : 2
2021-03-17 00:00:00 : 2
2021-04-14 00:00:00 : 2
2020-12-01 00:00:00 : 3
2021-02-16 00:00:00 : 3
2021-04-15 00:00:00 : 3
2021-04-16 00:00:00 : 3
2020-11-15 00:00:00 : 4
2021-05-05 00:00:00 : 4
2021-02-09 00:00:00 : 6
2021-04-20 00:00:00 : 8
2021-04-27 00:00:00 : 8
2021-04-17 00:00:00 : 16
2021-03-07 00:00:00 : 33
2021-02-28 00:00:00 : 67
2021-04-28 00:00:00 : 142
2021-05-07 00:00:00 : 144
2021-04-21 00:00:00 : 213
I think it's been fixed by #5214 , however, it seems that we didn't clean the already stuck payments that are dangling there before the fix. I'll draft a PR to address this issue.
We currently have 676 IN_FLIGHT payments for this node.
@miketwenty1 Just to be sure, all these payments should have been marked as failed right?
@yyforyongyu, just fyi.. im @miketwenty1 You tagged the wrong Michael there. 😂
We currently have 676 IN_FLIGHT payments for this node.
@michael1011 Just to be sure, all these payments should have been marked as failed right?
Every "failure_reason" is "FAILURE_REASON_NONE"
running v0.13.1-beta.rc2 did not resolve this issue @yyforyongyu (updating title of issue).
running v0.13.1-beta.rc2 did not resolve this issue @yyforyongyu (updating title of issue).
Thanks for the update! This will be fixed in two steps,
- clean the circuit maps for closed channels #4364
- timeout payments when loading them from disk #5461
Just adding a quick note that this is affecting one of my withdrawal attempts from a LN exchange (southxchange).
quick note: 2 withdrawal attempts with this issue in bitcoin beach wallet https://github.com/lightningnetwork/lnd/issues/4788#issuecomment-1001669692
Can someone confirm if it is safe to return the funds? (one of the txs is for more than 1 million sats so we need to be extra sure)
What do you mean by "return the funds"?
What do you mean by "return the funds"?
The issue is with a payment done by one of the users of Bitcoin Beach Wallet... so we have the tx as pending, i.e, the amount was discounted from user balance. We need to know if it is safe to void the transaction (user tx)