lnd icon indicating copy to clipboard operation
lnd copied to clipboard

[bug]: SCB created after desync bug is not usable

Open mutatrum opened this issue 3 years ago • 3 comments

Background

After a HDD failure, a user reported problems when recovering a static channel backup (SCB) which was created on 0.15.3, after #7096. User tried recovery with a slightly older backup, which was created before the desync, and this succeeded.

Your environment

LND 0.15.3/0.15/4 Other information: TBD

Steps to reproduce

Scenario was as follows:

  • LND 0.15.3, create SCB after the node desync due to #7096.
  • Start with a clean LND 0.15.4
  • Apply static channel backup
  • LND reports 0 channels, 0 balance

Expected behaviour

Even though the node was desynced, at all times the SCB should be usable. Especially since recovery with an older SCB file is strongly advised against.

Actual behaviour

LND reported 0 channels, 0 balance.

The broken backup file is apparently 1kb larger in size.

User report is in twitter thread: https://twitter.com/Printer_Gobrrr/status/1587719434592047105

mutatrum avatar Nov 02 '22 14:11 mutatrum

Comment from @guggero on Slack:

Would be great to get a diff of both files when running them through chantools dumpbackup (which requires the seed to decrypt them). https://github.com/guggero/chantools/blob/master/doc/chantools_dumpbackup.md

mutatrum avatar Nov 02 '22 14:11 mutatrum

After a HDD failure, at all times the SCB should be usable. Especially since recovery with an older SCB file is strongly advised against.

This may be dependent on the file system itself, is the file system had corruption and lost information, then we wouldn't be able to read the SCB, or it could have been garbled itself.

Roasbeef avatar Nov 02 '22 17:11 Roasbeef

The broken backup file is apparently 1kb larger in size.

If the user is willing, think they can share the back up itself? If the file was garbled to the point of not being readable, then it's mostly a file system thing perhaps.

The nice thing about SCBs, is that you can safely replicate them elsewhere: https://github.com/lightningnetwork/lnd/blob/master/docs/safety.md#keeping-static-channel-backups-scb-safe

Roasbeef avatar Nov 02 '22 17:11 Roasbeef

Comment from @guggero on Slack:

Would be great to get a diff of both files when running them through chantools dumpbackup (which requires the seed to decrypt them). https://github.com/guggero/chantools/blob/master/doc/chantools_dumpbackup.md

I tried to decrypt them on 2 separate machines, to no avail. The broken backup file cannot be decrypted by chantools. Chantools claims it's not a valid backup file.

SomeBTChomer avatar Nov 03 '22 08:11 SomeBTChomer

If the user is willing, think they can share the back up itself? If the file was garbled to the point of not being readable, then it's mostly a file system thing perhaps.

I'm not sure if I can safely share this file without compromising security. If so, does it help you at all without the corresponding seed?

SomeBTChomer avatar Nov 03 '22 08:11 SomeBTChomer

Thanks for the response. So this does sound like file system or disk data corruption then, as speculated by @Roasbeef. What exact error message did you get with chantools? And I assume you are sure it's the correct seed? Using the wrong seed for decrypting can lead to the same error message as a corrupt file.

If you can't decrypt the file with the seed, then there isn't much we can look at. That's the thing with encrypted files, they are supposed to be looking like random data. Or at least I wouldn't know what to look at exactly.

guggero avatar Nov 03 '22 09:11 guggero

The hdd of the node failed, but the backup was saved in a different location.

The exact error message from chantools is: "Error: backup file is required" which it's only writing after inputting the correct seed. It managed to decrypt another channel backup with the same seed without problems.

SomeBTChomer avatar Nov 03 '22 09:11 SomeBTChomer

Was the backup written to the failing disk first? Or was it written directly to the other location?

mutatrum avatar Nov 03 '22 10:11 mutatrum

Was the backup written to the failing disk first? Or was it written directly to the other location?

It was written to the backup location, my file server in this case. Other files on this server are intact. I can't say for sure if Raspiblitz writes the file to the hdd first and then copies it to the redundant location, though.

SomeBTChomer avatar Nov 03 '22 10:11 SomeBTChomer

"Error: backup file is required"

That sounds like the command was incorrect. Did you specify the --multi_file flag?

It was written to the backup location,

Do you mean, you have the --backupfilepath= flag (or backupfilepath= config option) set on your node?

guggero avatar Nov 03 '22 11:11 guggero

That sounds like the command was incorrect. Did you specify the --multi_file flag?

I did, but there was a typo in the filename, my bad. See attached file for the output.

Do you mean, you have the --backupfilepath= flag (or backupfilepath= config option) set on your node?

I used the Raspiblitz's inbuilt backup script, I admit I don't know how it works under the hood.

SomeBTChomer avatar Nov 03 '22 11:11 SomeBTChomer

Thanks for the file. So this is the larger one, the one not working for recovery? Would you mind sending me the other one (the one that worked) in this format as well please, so I can compare the two? You can also send it to me on Slack if you want.

I used the RaspiBlitz's inbuilt backup script

I took a quick look and it seems the RaspiBlitz setup leaves the channel.backup file in its original location which is on the hard disk. So if the disk fails, then the backup can become corrupted as well. That's what the option backupfilepath= of lnd is for, to write the backup to a different location that's not the same physical disk. That might be worth a feature request on their end (maybe one already exists?).

guggero avatar Nov 03 '22 11:11 guggero

Thanks for the file. So this is the larger one, the one not working for recovery? Would you mind sending me the other one (the one that worked) in this format as well please, so I can compare the two? You can also send it to me on Slack if you want.

The one above was the one that did result in a 0 balance restore. Here's the other output from the backup that worked.

I took a quick look and it seems the RaspiBlitz setup leaves the channel.backup file in its original location which is on the hard disk. So if the disk fails, then the backup can become corrupted as well.

Thanks for checking, I guess that could explain why the older backup was the one that could be successfully restored.

SomeBTChomer avatar Nov 03 '22 12:11 SomeBTChomer

Thanks for the file. I compared the two and they are identical... Also in their size. How big of a file size difference is there in the encrypted ones? Or did you dump the same file twice by accident?

Maybe the reason for the 0 balance was something else? The node just taking longer to sync to the chain? Or tor not being able to connect out? Or just not waiting long enough in general (restoring the backup can sometimes take quite a while as the chain backend needs to do a lot of re-scanning which especially on a Pi can take literally hours to days).

guggero avatar Nov 03 '22 12:11 guggero

Interesting. The encrypted files differ by 1.2kb in size. The file sizes are 14.8 and 16 kb.

I followed the exact same process to restore the node. Set up a fresh blitz > connect hdd with blockchain on it > choose restore from seed and channel backup > ctrl+c to terminal, stop lnd > update lnd to 0.15.4 > restart lnd > wait.

The only difference was, the first "recoverscan" lnd did took about 6 hours and resulted in the 0 lnd balance restore. The second try, with the slightly older file, took ~20 minutes, and it worked.

Edit: The only thing I can think of is: I had to kill the lnd service because "systemctl stop lnd" timed out and did nothing. After killing the lnd service, the update to 0.15.4 went smoothly.

SomeBTChomer avatar Nov 03 '22 13:11 SomeBTChomer

Ah, I think that's the problem. If you stop lnd while it's re-scanning the chain in the initial wallet-recovery phase, it won't continue on the next restart. That's also why the stop did not work, it does not allow you to shut down during that period (using lncli stop would've told you that as an error) to prevent this known problem. That's why it's always good to check the logs as well when you think lnd is stuck. So I'm pretty sure there was nothing wrong with the backup itself. I think there's an issue for the problem of not being able to resume a wallet rescan if it's aborted. Looking for it now.

guggero avatar Nov 03 '22 14:11 guggero

That's why it's always good to check the logs as well when you think lnd is stuck.

I did check the logs, but there was nothing, not one line, I supposed that was because of the bug.

If you stop lnd while it's re-scanning the chain in the initial wallet-recovery phase, it won't continue on the next restart

Good to know, what I don't understand is, I did that both times. And after the restart following the lnd update, it did scan again.

SomeBTChomer avatar Nov 03 '22 14:11 SomeBTChomer

I supposed that was because of the bug.

Yes, that's my assumption as well.

And after the restart following the lnd update, it did scan again.

Do you have any logs left from that? Because my suspicion is that it did scan the latest blocks that came in, but didn't do the wallet recovery (re-scan for specific wallet addresses) anymore.

I was looking for the old issue that tracks the ability to resume an address-rescan if it is aborted. But I couldn't find it (maybe I remember incorrectly). So I'm going to create a new one. But since we were able to confirm the SCB was fine, I'm going to close this one.

guggero avatar Nov 03 '22 14:11 guggero

Do you have any logs left from that?

Unfortunately not.

Either way, I'm just happy it worked out in the end. Thanks for digging into this and helping me understand why it happened.

SomeBTChomer avatar Nov 03 '22 15:11 SomeBTChomer