blixt-wallet icon indicating copy to clipboard operation
blixt-wallet copied to clipboard

Poor route selection

Open rkfg opened this issue 3 years ago • 52 comments

Just had a situation yesterday when Blixt couldn't send a payment because of no liquidity on the route (and I repeated it 3 times after timeouts) while SBW did it in just two attempts (≈2 seconds). Background:

  • I have Blixt connected to my own node via a private channel
  • the payment was to WalletOfSatoshi node using a static LNURL QR
  • I monitored the attempts on my node with lntop
  • the amount was quite small, 10k sats

The attempts went fine through my node (no link failure), they failed somewhere later on the route. After about 10 attempts the payment fails due to timeout, I tried again and then again without success. The total number of attempts should be close to 30. Then I started SBW (also connected to my own node via a private channel), scanned the same code, entered the same amount and from the lntop log I saw this:

  • SBW tried to pay using my direct channel to WoS which was drained down to the reserved limit, it failed
  • SBW then tried to pay through ACINQ with higher fee and succeeded

I know that the LN implementation in SBW is different and doesn't use lnd unlike Blixt. Probably it has different priorities for fee selection so it can sacrifice a few sats to make the payment faster and Blixt tries to save on fees but it results in very poor UX. There's a time_pref parameter in lnd 0.15.0, maybe it can help with this after your version is updated. Or something else can be tweaked to prefer more reliable paths instead of the cheapest ones. In my case the difference was just 5 sats when paid by SBW and Blixt tried to get 0 or 2 at most. I don't think 5 sats are worth the trouble when the wallet doesn't do its primary job.

rkfg avatar Sep 16 '22 13:09 rkfg

If you have only ONE channel with your node, and your node do not find enough paths to forward that tx, where is the issue with Blixt? Open more channels on your Blixt, as we suggest: one with your node, one with Blixt node, one with ZFR or any other good positioned nodes. Then use the MPP.

Using just ONE channel with your node you are ALWAYS depending of your own node liquidity and paths. You are concentrating all the flow into just one pipe. Blixt have nothing to do with this, it is your node that couldn't forward correctly the tx. First learn how LN works and manage correctly your node. Yes, there are some issues with LN path finding, in general, but that have nothing to do with Blixt, and in special with your particular case.

Darth-Coin avatar Oct 08 '22 10:10 Darth-Coin

If you have only ONE channel with your node, and your node do not find enough paths to forward that tx, where is the issue with Blixt?

That's how most mobile wallets operate, no? Why would I need more than one channel to my well connected node that itself has 50+ channels? I'm interested in using my own node, not 3rd parties. And as I said it works perfectly with SBW but they use their own LN implementation. Also just one channel to my node.

Blixt have nothing to do with this, it is your node that couldn't forward correctly the tx.

Incorrect. Blixt does the route search, not my node. It's one of its value propositions compared to trampoline routing in Electrum, for example.

First learn how LN works and manage correctly your node.

Thanks, I guess I have enough knowledge of it, been learning since 2021 down to the commitment structure, HTLC stages and onion routing. I think I can tell what component is responsible for the failures. In my case Blixt was trying to find the cheapest route using my channels which is suboptimal imo. The failures were not on my node, it's not a liquidity problem. The payments failed further down the route and I have nothing to do with it. What Blixt can do, however, is to raise the fee limit and time preference (but it's a newer option as I said so might require lnd update). Then it would choose a more expensive but working route instead of getting stuck in cheap but failing ones.

rkfg avatar Oct 08 '22 11:10 rkfg

You are wrong in your assumptions. Yes, on a mobile node you don't need 50 channels, but at least 3 max 5-6 are recommended for reliability. Never depend only on one single pipe. Also with more channels you can take the advantage of MPP. Yes, Blixt is doing the route search, but if it have ONLY your node as 1st hop always, all routes will depend on YOUR node not on Blixt search. Is your node that is giving the rest of the route, not Blixt. Is exactly like a network of pipes.

You need to learn more how to manage your routing node. There is the problem. Choose wisely your peers if you want to use it as LSP for your Blixt mobile.

So again, this is NOT a Blixt issue.

Darth-Coin avatar Oct 08 '22 12:10 Darth-Coin

Have you read my message at all? I told you I don't have these issues with another wallet that is SBW which is connected exactly the same way to my own node. I did several tests, Blixt struggles to find a route, SBW does it instantly even though that route might be slightly more expensive. But it DOES work. So yes, this IS a Blixt issue. Why disregard it if it clearly can be improved? My node is only responsible for those 50+ channels but the rest is out of my reach. And the faliure was not on those channels, they have plenty of liquidity as I already said several times. Also, MPP should work even with one channel, I did payments with lncli and MPP and it sent multiple shards along the same route in parallel (all hops were identical).

Please don't derail the problem discussion and put the blame on me, it's unprofessional.

rkfg avatar Oct 08 '22 12:10 rkfg

Any ideas, @hsjoberg? It's still a big issue for those who want to use their own nodes. I suppose the core issue is lnd doing a poor job. What I see is Blixt picks 1-2 peers next to my node and tries to pay via them completely ignoring the rest and failing many time despite there's being a lot of liquidity in some more expensive channels. It looks like this:

Blixt => my full node => peer A => some other nodes Blixt => my full node => peer B => some other nodes

I set the max fee to 50% and it still didn't help. It appears to me that lnd uses the depth-first Dijkstra search algorithm instead of a more appropriate breadth-first, imo. It's more likely that a peer has low outbound liquidity further down the path if it failed a couple of times so it's probably better to switch to another peer instead of bashing this one for a minute and timing out.

Tried bimodal pathfinding, same issue. I'm just trying a small amount like 19-250 sats at https://satoshis.place and Blixt never succeeds. SBW (that uses their own lightning implementation and as such, different pathfinding) pays very quickly after a couple of failed attempts. Both connected to my own full node that, as before, has plenty of liquidity in both directions, reasonable fees and even provides rough liquidity hints with max_htlc.

Also tried resetting the graph and syncing it again from scratch, it finally went through after a couple of timeouts. After the reset Blixt keeps harassing the Boltz route (which is known to have very poor outbound liquidity as they don't give any hints), doesn't even look at the other nodes/channels. Before the reset it was trying with a different node on every try.

rkfg avatar Aug 09 '23 11:08 rkfg

@rkfg try opening a channel to my node from your Blixt Wallet: https://amboss.space/node/028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a

djkazic avatar Oct 04 '23 18:10 djkazic

Thanks, but the point is to use my own node and it should work regardless of other channels, not trying just one channel to a node that fails often. Otherwise I'd rather open to a big well connected node so that I always have both inbound and outbound through it.

rkfg avatar Oct 04 '23 21:10 rkfg

Have you tried using the rapid LN sync feature?

djkazic avatar Oct 04 '23 21:10 djkazic

It sounds like you maybe have an incomplete graph. Show a screenshot of your get network info output.

djkazic avatar Oct 04 '23 21:10 djkazic

Screenshot_20231005-010322

rapid LN sync feature

Couldn't find such a setting. Where is it?

rkfg avatar Oct 04 '23 22:10 rkfg

Yeah that's your problem. You got tons of zombies. Update to the latest version. The feature is called scheduled LN channel sync.

djkazic avatar Oct 04 '23 22:10 djkazic

See the changelog for https://github.com/hsjoberg/blixt-wallet/releases/tag/v0.6.8 for more info.

hsjoberg avatar Oct 04 '23 22:10 hsjoberg

I have version 0.6.8 from Google Play, looks like it's the latest one. In "Show node data" both synced to chain and graph are true. Is that a bug then that it accumulates these zombie channels? I remember resetting the graph before and it loaded and worked fine for some time. I enabled scheduled LN channel sync but honestly this shouldn't be such a hard requirement that disabling this setting renders the wallet unusable and this state seems to be non-recoverable automatically, unless I reset the graph again.

rkfg avatar Oct 04 '23 22:10 rkfg

Ah, I remember why I turned this setting off: the startup time became even bigger than usual with a spinner "Syncing lightning network" which I see now that I restarted the app. I'm not sure what exactly it syncs because after opening the main activity there's still a sync icon and the channel isn't active until it disappears. Honestly, the startup time is enormously long.

Even after this sync I still have zombies, even more now: 74592.

rkfg avatar Oct 04 '23 22:10 rkfg

That means it's failing to do the gossip sync.

It's a critical part of Blixt that is supposed to stop LND from flagging a majority of the graph as unusable (zombies). I recommend fully stopping the app, clearing the cache, and then starting Blixt. After the sync you should have ~2k zombies.

djkazic avatar Oct 04 '23 22:10 djkazic

The cache was very small in my case, just 1.5 Mb. The storage is 2.1 Gb though but clearing it would nuke everything. I cleared the cache and restarted the app, I don't see any sync progress as the network info is the same.

rkfg avatar Oct 04 '23 22:10 rkfg

Sounds like something didn't work correctly. You are correct to not clear the data, please do not do that.

Please do the following procedure:

  1. First, force stop Blixt app.
  2. Secondly, clear cache.
  3. Third, launch Blixt.
  4. The scheduled sync page should load.

Also, make sure you are on the latest version of the wallet.

djkazic avatar Oct 04 '23 22:10 djkazic

I did exactly that, I see the "Syncing Lightning Network" page, after that the app starts, the network info is still the same. The version is 0.6.8 as I stated above.

rkfg avatar Oct 04 '23 22:10 rkfg

OK, hold on. I'm doing a server-side refresh.

djkazic avatar Oct 04 '23 22:10 djkazic

Please try the procedure again.

djkazic avatar Oct 04 '23 22:10 djkazic

Now the number of zombies is zero and it looks like the graph is re-syncing. What changed exactly? Is Blixt dependent on some server I don't know about? I set my own node in the settings, I'd rather have zero dependencies on other servers if possible.

rkfg avatar Oct 05 '23 06:10 rkfg

What changed exactly? Is Blixt dependent on some server I don't know about?

@rkfg Yes, there is a server involved in speedloader. See more info here (and also the 0.6.8 changelog as aforementioned): https://twitter.com/BlixtWallet/status/1674029478115266560

There is no hard dependency, no. However, unless your Blixt always is online* in order to receive gossip data, your channel graph database will be degraded over the time. This is a consequence of gossip messages being missed, which causes channels to be marked as unusable ("zombies") by lnd.
Lnd and Lightning was designed with the assumption that nodes would always be online.

*) We will introduce a persistent app mode in the next version 0.6.9 on Android. This is an alternative to speedloader as Blixt/lnd will always be online in order to receive gossip data. Some Android phone vendors also include an app pinning functionality that lets you achieve something akin to this.

I set my own node in the settings, I'd rather have zero dependencies on other servers if possible.

Understandable. We launched this initial version of speedloader server hardcoded in Blixt to use https://maps.eldamar.icu/mainnet/graph/graph-001d.db, https://maps.eldamar.icu/mainnet/graph/ and https://maps.eldamar.icu/mainnet/graph/MD5SUMS.

In the next version 0.6.9, we will have a setting for changing the speedloader server and also instructions on how to set it up yourself.

hsjoberg avatar Oct 05 '23 08:10 hsjoberg

Thank you for the clarification, there's no information about how this speedloader thing actually works except in your message. I get that it's a technical quirk but I think it should be explained explicitly because it's a centralizing feature. I hope there will be a way to run such a server on my own infra and specify it in the settings. It's also weird that SBW (the version that still supported LN) doesn't suffer from this.

rkfg avatar Oct 05 '23 08:10 rkfg

Anyone can run this, it's open source software:

  • https://github.com/djkazic/speedloader
  • https://github.com/djkazic/primer

You can even run your own in the future, we're just running the default one because you need a powerful CPU to calculate differential transfer deltas.

It's also an optional feature, using persistent mode also helps prevent graph degradation on Blixt.

Also, SBW does not use LND. So it isn't valid to compare the two directly on this. SBW very likely has a different approach to graph management / pruning.

djkazic avatar Oct 05 '23 14:10 djkazic

Just as an example of it being extensible, Zeus with embedded lnd runs their own primer server for speedloader capabilities. So rest assured it's not a black box

djkazic avatar Oct 05 '23 15:10 djkazic

Thank you! I think this method is better than the persistent mode because it would probably drain the battery if the phone can't sleep. I previously assumed that lnd already downloads these channel announcement deltas because it's in the log (applying gossipFilter start=...) so it looked similar to how bitcoind operates, querying the announcements it missed when it was offline. Turns out it's more fragile. Is it possible in theory to fix it in lnd itself so no external daemon is needed? Or does the LN spec in its current form prevent it?

rkfg avatar Oct 05 '23 15:10 rkfg

The problem isn't lnd per-se, it's that lnd was designed for servers and were running it inconsistently on a phone.

Gossip sync is eventually consistent, so while you could technically restore your zombies to zero organically by having it on all the time that's impractical for most users.

Therefore we compress the set of changes to the channel graph using speedloader, instead of your lnd needing to download all the deltas you can now download one big delta and apply it.

djkazic avatar Oct 05 '23 15:10 djkazic

IIRC CLN has a similar piece of software that does this set compression, and Breez wallet bootstrapping is the basis for speedloader tech. It's become a common primitive for mobile lightning environments

djkazic avatar Oct 05 '23 15:10 djkazic

Ah, got it. Yeah, lnd usually sends updates once a day at least or more often if the channel policy changes so I'd need to run it at least 24 hours straight for all active channels to be updated, and even longer to catch up on the inactive ones that become active in the future. Interesting, I tinker with LN for about 2 years now and didn't know about this problem. I expected the node to catch up right after connecting to some peers and querying for what it missed.

rkfg avatar Oct 05 '23 15:10 rkfg

That is exactly what it does. When you see that sync has stopped, that doesn't actually mean you have the full graph

It just means you don't have any additional nodes or edges that your peers know about (but your node doesn't). As a result you can still have an incomplete graph.

The way that zombie marking works is that you need to see an announcement for it to come back, and many times that can be missed with a delta of up to a day IME.

djkazic avatar Oct 05 '23 15:10 djkazic