lightning icon indicating copy to clipboard operation
lightning copied to clipboard

pay: most payments fail

Open kilrau opened this issue 2 years ago • 18 comments

Issue and Steps to Reproduce

Most payments (as small as 10000 sats) we are trying to send via CLN's pay gRPC are failing with Reason: Ran out of routes to try after 18 attempts: see paystatus which is why we are not using CLN for sending payments in production right now, only receiving. Only opening this issue now because we wanted to make sure it's not sth incredibly stupid on our side. Couldn't find anything obvious.

Observations:

  • Quite often we see a large amount of mpp shards (like 15) where 1 or 2 fail that make the entire payment fail. This is the most evident thing we are observing - why so many shards?
  • Via gRPC CLN gives up trying very fast. Like within seconds.
  • The number of attempts sending the same amount to the same destination varies greatly:
PaymentPending
PaymentFailed

Reason: Ran out of routes to try after 48 attempts: see `paystatus`
---
PaymentPending
PaymentFailed
Reason: Ran out of routes to try after 61 attempts: see `paystatus`
---
PaymentPending
PaymentFailed
Reason: Ran out of routes to try after 18 attempts: see `paystatus`
  • Smaller manual payments via lightning-cli sometimes work fine, whereas they mostly fail on gRPC (just a hunch)
  • Not all gRPC payments fail, some go through, but it's sth like >90% failure rate
  • Routing and all other things work fine
  • Our CLN node definitely is large and well connected enough by now to reliably send the payments we are trying to send
  • Nothing helpful in the logs

We are absolutely clueless on what's going on and if it's simply the amount of mpp shards that make things fail. Any hints appreciated.

getinfo output

{
   "id": "02d96eadea3d780104449aca5c93461ce67c1564e2e1d73225fa67dd3b997a6018",
   "alias": "Boltz|CLN",
   "color": "ff9800",
   "num_peers": 88,
   "num_pending_channels": 0,
   "num_active_channels": 80,
   "num_inactive_channels": 1,
   "address": [
      {
         "type": "ipv4",
         "address": "45.86.229.190",
         "port": 9736
      },
      {
         "type": "ipv6",
         "address": "2a10:1fc0:3::270:a9dc",
         "port": 9736
      },
      {
         "type": "torv3",
         "address": "oo5tkbbpgnqjopdjxepyfavx3yemtylgzul67s7zzzxfeeqpde6yr7yd.onion",
         "port": 9736
      }
   ],
   "binding": [
      {
         "type": "ipv4",
         "address": "0.0.0.0",
         "port": 9736
      },
      {
         "type": "ipv6",
         "address": "::1",
         "port": 9736
      }
   ],
   "version": "v23.08.1",
   "blockheight": 812772,
   "network": "bitcoin",
   "fees_collected_msat": 80795360,
   "lightning-dir": "/home/bitcoin/.lightning/bitcoin",
   "our_features": {
      "init": "08a0000a8a69a2",
      "node": "88a0000a8a69a2",
      "channel": "",
      "invoice": "02000002024100"
   }
}

https://amboss.space/node/02d96eadea3d780104449aca5c93461ce67c1564e2e1d73225fa67dd3b997a6018

kilrau avatar Oct 18 '23 16:10 kilrau

Adding questions @michael1011 previously asked on discord:

  • is there a way to force it to make it split less aggressively? Like eg only to at most 2 parts
  • I am seeing pay terminate and fail wayyy before the limit set by retry_for. Ran out of routes to try after 54 attempts: see paystatus would that error mean that all possible routes have been tried and failed? I can't imagine that to be true, because different attempts of paying the same invoice show different numbers

kilrau avatar Oct 18 '23 16:10 kilrau

Are you sending to nodes with private channels?

kingonly avatar Oct 19 '23 12:10 kingonly

It's pretty big nodes we tried sending to so far so high probability that these have private channels, yes.

kilrau avatar Oct 19 '23 14:10 kilrau

Does the invoice has routing hints is basically what I'm asking.

kingonly avatar Oct 19 '23 16:10 kingonly

Yes almost all do. Will do some test using invoices without routing hints.

kilrau avatar Oct 20 '23 10:10 kilrau

Update: no difference when paying 100k sats invoices without routing hints. image

Again the destination is a big node and we are definitely well connected enough to pay 100k sats invoices.

kilrau avatar Oct 20 '23 12:10 kilrau

So this is very interesting:

  • paying 100k sats invoice - failed
  • paying 50k sats invoice - failed
  • paying 10k sats invoices worked flawlessly for 50x in a row = 500k sats. It always used the exact same route. So there was a route that could have easily handled the larger payments - how can it be that the first two payments didn't use this route? Also CLN didn't have to try much - it knew the route already and used it at first shot for the 10k sats payments.

Btw for these tests I am always using (for my standards) a very high routing fee limit of 0.5% / 5000ppm to make sure this is not causing any failures and obviously always the same destination.

kilrau avatar Oct 20 '23 12:10 kilrau

Another test that simply confirms that there is no difference between lightning-cli and grpc (as expected):

  • lightning-cli pay <100k sats invoice> - failed after 4 attempts.
{
   "code": 210,
   "message": "Ran out of routes to try after 4 attempts: see `paystatus`",
   "attempts": [
      {
         "status": "failed",
         "failreason": "failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)",
         "partid": 0,
         "amount_msat": 100000000
      },
      {
         "status": "failed",
         "failreason": "failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)",
         "partid": 1,
         "amount_msat": 100000000,
         "parent_partid": 0
      },
      {
         "status": "failed",
         "failreason": "failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)",
         "partid": 2,
         "amount_msat": 100000000,
         "parent_partid": 1
      },
      {
         "status": "failed",
         "failreason": "failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)",
         "partid": 3,
         "amount_msat": 100000000,
         "parent_partid": 2
      }
   ]
}

paystatus also reveals that CLN wanted to try 10 times but it only tried 4 times. I don't know why.

  • paying 10x 10k sats invoices via grpc worked flawlessly again.

Again same destination, same 5000 ppm routing fee limit for both.

kilrau avatar Oct 20 '23 12:10 kilrau

Test if routing hints play a role: they don't

Same behavior with invoices that contain routing hints:

  • paying 100k sats invoice - failed
  • paying 10x10k sats invoice - success and again used the same route for all 10 payments

Again same destination, same 5000 ppm routing fee limit for both.

kilrau avatar Oct 20 '23 14:10 kilrau

Ran similar tests today and .. it doesn't reproduce as reliably as before. Sometimes the first 100k payment or even larger goes through :/

Same destination as before, same 5000 ppm routing fee limit as before.

Any ideas what else I could try?

kilrau avatar Oct 24 '23 13:10 kilrau

Any ideas what else I could try?

When the payment fails it is possible that we choose the wrong path and we keep choosing the wrong one over and over again.

Or is it possible that some channels as disabled? e.g: On chain fee disagreement?

From your messages, I am not able to track down what is happening, but I think it is also difficult from the pay command to track what is going for you due that we hide the complexity under the hood

vincenzopalazzo avatar Oct 26 '23 10:10 vincenzopalazzo

choose the wrong path and we keep choosing the wrong one over and over again

I would love to follow exactly which routes are being tried and see with which errors they fail but I simply couldn't find how. How can I?

kilrau avatar Oct 26 '23 10:10 kilrau

How one should be able to follow the routes pay is trying, quoting @cdecker from https://discord.com/channels/899980449231814676/899989708065632276/1167497746538840154:

The pay plugin should be logging the path (sequence of scids) before attempting it. Grep for the partid and you can follow the payment relatively easily ( and then you can dice into the adjacent lines when you see something that you'd like to investigate further)

But we decided that for us pay doesn't behave close enough to how we'd expect it to and started mpay, our own payment plugin that simplifies stuff quite a bit (doesn't split payments for now) and remembers routes that worked well for past payments (similar to lnd's mission control). I can see other's struggling with pay too, but I'm happy to close here since I won't be debugging this any further.

kilrau avatar Dec 04 '23 13:12 kilrau

Just to leave this final comment for anyone from the CLN team picking this up: pay in its current form is absolutely unreliable, either give this a proper look or switch it out for something else that ships by default with CLN. Happy to help and share experiences we made with mpay :v:

kilrau avatar Dec 11 '23 09:12 kilrau

@kilrau maybe the reny-pay will be a good target for this, maybe @Lagrang3 is the correct person to explore this path

vincenzopalazzo avatar Dec 11 '23 11:12 vincenzopalazzo

Sure, happy to help and test but we will not be leading this effort. We'll work on our own mpay. The comment was just meant to get attention from the team that pay has serious issues. We even switched our regtest away from pay since it doesn't reliably use a direct channel :/

kilrau avatar Dec 11 '23 11:12 kilrau

@kilrau maybe the reny-pay will be a good target for this, maybe @Lagrang3 is the correct person to explore this path

Thanks for bringing this up. @kilrau, I am working on renepay a new payment plugin that doesn't produce arbitrary MPPs but instead uses @renepickhardt's algorithm; there is a short description of how this works in the documentation and the original PR #6376. Unfortunately, the plugin right now is somewhat unstable. But I am working on it to fix the most evident bugs before release 24.02.

Lagrang3 avatar Dec 11 '23 13:12 Lagrang3

I just installed Boltz Client to test it out today and ran into this. Here is some information in case it's helpful. It is worth noting that my node has a channel open to 026165850492521f4ac8abd9bd8088123446d126f648ca35e60f88177dc149ceb2. @kilrau has informed on Discord that this target node is LND and is usually setup to handle swaps > 1m sats, while there is a CLN node that handles swaps for smaller amounts. My swap is for 100k sats, so it should target the smaller node.

Observations

  1. In the attached paystatus JSON log, the erring_node is this LND node in the last 3/4 attempts
  2. erring_channel is different each time
  3. Only four attempts where made, although the strategy message on the last attempt indicates "Still have 7 attempts left"

Let me know if I can provide more information to help diagnose.

boltz@minibolt:/data/boltz$ boltzcli createreverseswap --currency LBTC 100000
You will receive the withdrawal to the specified onchain address
The fees for this service are:
  - Service fee: 0.2%
  - Boltz lockup fee: 276 satoshis
  - Claim fee: 143 satoshis

? Do you want to continue? Yes
Swap Status: swap.created
Swap ID: 7TSb3cusMVrs

Swap Status: swap.created
Error: rpc error: code = Unknown desc = Error calling method Pay: RpcError { code: Some(210), message: "Ran out of routes to try after 4 attempts: see `paystatus`", data: None }

boltz-paystatus-failed.json

s373nZ avatar Mar 15 '24 15:03 s373nZ

Thanks for keeping the information flowing. Debugging these is always hard because you wont hear back once you fixed it, so that's a big help.

Did you guys have a chance to test again now that #7188 is merged in master to see if it (partially) addresses the issue. @JssDWt found a really well hidden bug that my accounting code had that would result in the channel capacity estimates getting lower and lower as attempts go on.

cdecker avatar Apr 08 '24 09:04 cdecker

The patch from @JssDWt is included inside the point release? https://github.com/ElementsProject/lightning/releases/tag/v24.02.2

I did not check the diff from the previous one, but if it is included we should advertize inside the release description, right?

vincenzopalazzo avatar Apr 08 '24 09:04 vincenzopalazzo

@cdecker FWIW I upgraded to CLN v24.02.2 and boltzd v2.0.2-0b4c946 then ran the exact same command as my previous example, but with the same result. Still an issue for me, if indeed my use case applies to this ticket. :(

s373nZ avatar Apr 08 '24 17:04 s373nZ

Some more anecdotal evidence from me. Default pay failing lately a lot for me. Installed mpay (after fighting with python a bit:) and can confirm mpay can pay what the default pay can't.

urza avatar Jun 05 '24 21:06 urza

mpay is awesome, can we have it for bolt12 invoices as well? 😅

https://github.com/BoltzExchange/boltz-backend/issues/565

urza avatar Jun 10 '24 09:06 urza

Closing in favor of #7180

cdecker avatar Jun 25 '24 10:06 cdecker