lnd icon indicating copy to clipboard operation
lnd copied to clipboard

[bug]: garbage collect `batch-replay` bucket in `sphinxreplay.db`

Open C-Otto opened this issue 3 years ago • 28 comments
trafficstars

The sphinxreplay.db on my node is larger than my channel.db (2.2 GByte vs. 1.8 GByte), which seems odd. Please add a feature that reduces the size of sphinxreplay.db.

(And, while we're at it, please document what the file is actually used for: #6584)

  • lnd v0.15.4-beta
  • bitcoind v23

C-Otto avatar Nov 02 '22 20:11 C-Otto

The Sphinx replay DB is lnd's implementation of the ReplayLog defined in the sphinx packet. It keeps track of onion packets to make sure they aren't used in a replay attack. Each package is stored with its CLTV and is garbage collected as soon as the CLTV has expired. That means the DB is subject to a lot of reads and writes and therefore does profit from regular compaction.

Can you try compacting it and see if it still stays as as it currently is?

Maybe another option (after compaction) would be to take a look at it with boltbrowser (shut down lnd first, create a copy of the DB file, start lnd again, inspect copy) and check whether there are expired hashes in there? Or maybe there's some data stored in there in addition to the hashes themselves that aren't properly garbage collected? I'll take a look at my DB as well to see if I can identify anything.

guggero avatar Nov 03 '22 13:11 guggero

This was the previous compaction: 2022-10-26 19:57:03.444 [INF] CHDB: DB compaction of .../sphinxreplay.db successful, 2405040128 -> 2255822848 bytes (gain=1.07x)

The one before: 2022-08-16 22:48:33.767 [INF] CHDB: DB compaction of /home/bitcoin/.lnd/data/graph/mainnet/sphinxreplay.db successful, 1969864704 -> 1900961792 bytes (gain=1.04x)

I don't think that having 2 GByte of data is OK?

C-Otto avatar Nov 03 '22 13:11 C-Otto

Yeah, you're right. That's a lot of data. And I think I found the bug. We write to the batch-replay bucket here but we apparently never delete from it. That needs garbage collection as well. Though I'm not familiar with it enough to say whether that's easy to fix or not.

guggero avatar Nov 03 '22 13:11 guggero

I'm going to change the title to indicate this is a bug.

guggero avatar Nov 03 '22 13:11 guggero

I have 612,489 pairs in batch-replay, 31,389 pairs in shared-hash.

C-Otto avatar Nov 03 '22 13:11 C-Otto

I had a look at the code and I can't see how the information persisted in batch-replay is used. The data is only read in a situation where we'd write the same data again. ~This doesn't affect the control flow, i.e. the code triggering this doesn't notice that the data already existed~ (wrong). Aside from this I don't see any read access to batch-replay.

PutBatch was added in 2018, without any test (EDIT: testing this function directly, in the sense of a unit test). The code related to batch-replay never changed, i.e. what we see today should match the original intent.

A comment shortly before the bucket put operation indicates that the persisted data can be used during recovery:

// Write the replay set under the batch identifier to the batch
// replays bucket. This can be used during recovery to test (1)
// that a particular batch was successfully processed and (2)
// recover the indexes of the adds that were rejected as
// replays.

Given the fact that this was introduced 4 yours ago, we should have better idea of how helpful/necessary this is. Sadly, I lack these insights. @guggero, as someone who lives and breathes bbolt (I think, sorry), what's your position on this?

NB: The information in "shared-hash" is used. When storing new data (Put, PutBatch), replays are detected and treated as an error. I guess this is the main reason for this code to exist.

C-Otto avatar Nov 04 '22 12:11 C-Otto

The idempotency idea is mentioned in the definition:

// batchReplayBucket is a bucket that maps batch identifiers to
// serialized ReplaySets. This is used to give idempotency in the event
// that a batch is processed more than once.

However, I don't see how idempotency helps. If the code invoking PutBatch always provides the same batch information, we do not need batch-replay persistence to have idempotency (I think).

C-Otto avatar Nov 04 '22 12:11 C-Otto

I took a closer look again as well. And to me it looks like the data persisted in batch-replay is used in PutBatch(). When storing a batch, we first check if one is already there. If it is, we return that. If not, we store the batch. So basically we use that to find out if we ever processed a batch like this before. But we never clean that up, that is quite obvious now.

It looks like the batch ID that is used as the storage key of a batch is set here: https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/link.go#L2914 which contains the short channel ID of the source channel of the HTLC: https://github.com/lightningnetwork/lnd/blob/master/channeldb/forwarding_package.go#L288

So a simplistic/naive approach at a cleanup would be to pass in all open/pending channels on startup and remove all batch IDs that are for channels that no longer exist. A more long-term solution would be to also store the max CLTV of a batch so we could clean up a batch after its on-chain expiry as well.

@Roasbeef do you perhaps have a better idea of how to clean up that bucket?

guggero avatar Nov 04 '22 12:11 guggero

What about deleting the bucket entirely? I don't see the downside of that (see draft PR #7116).

C-Otto avatar Nov 04 '22 12:11 C-Otto

Yes, I saw your PR. But I don't think we can delete the bucket, we need it to make sure a batch of packages wasn't relayed before.

guggero avatar Nov 04 '22 13:11 guggero

OK, I think I got it. If we call PutBatch with two packets a, b twice (for whatever reason) where only b is a replay, for both invocations we need to only return b as a replay. Without the bucket we'd check a and b individually and, for the second invocation, would return both as replays, as now also a was already seen (in the first invocation).

C-Otto avatar Nov 04 '22 13:11 C-Otto

How about this? https://github.com/lightningnetwork/lnd/pull/7120

Idea:

  • for old batches where we do not have any expiry information, use maximum known CLTV (from all entries in the shared-hash bucket) to initialize garbage collector
  • for new entries in batch-replay bucket also persist the maximum CLTV for that batch
  • garbage collect entries in batch-replay based on these CLTV values
  • limit "set expiry for batch-replay sets" and "garbage-collect batch-replay sets" operations to 500,000 sets per block to avoid overloading systems (Boltz has around 300 million sets)

C-Otto avatar Nov 06 '22 10:11 C-Otto

As migrating/GCing the old data seems rather expensive to me, would it be viable to just delete the sphinxreplay.db or empty the offending batch-replay bucket once?

C-Otto avatar Nov 06 '22 14:11 C-Otto

As migrating/GCing the old data seems rather expensive to me, would it be viable to just delete the sphinxreplay.db or empty the offending batch-replay bucket once?

I would not recommend that as I'm not 100% sure of the security implications. I guess you could stop accepting new HTLCs for the number of blocks that correspond to your max channel's CLTV delta value and then you had some confidence that a replay would fail because of wrong CLTV values. But again, I would not recommend just deleting the file.

guggero avatar Nov 07 '22 10:11 guggero

I just realized that the existing code also persists replay sets of size 0 for empty batches (without ever removing this data). Argh.

C-Otto avatar Nov 07 '22 17:11 C-Otto

Catching up with this issue a bit...

PutBatch was added in 2018, without any test. The code related to batch-replay never changed, i.e. what we see today should match the original intent.

Nope the tests were here in the commit that moved things over: https://github.com/lightningnetwork/lightning-onion/pull/22/files#diff-fe7ee5401125cf4bc9ac1938ae87474843b150586a145838d2fd1577c12592ef.

So PutBatch isn't actually called in lnd itself. Its purpose is to allow us to do the onion decode/process for a batch of HTLCs, given each processed HTLC needs to write an entry to the replay log to ensure we don't replay the HTLC. If we replay the HTLC, then this is a privacy leak, as a party can send the HTLC thru our node multiple times (captured packets or w/e) and attempt to see where they go in the network.

The call site where it's used is here: https://github.com/lightningnetwork/lightning-onion/blob/b62f49fcadfc013d81c639e196a7f09b428a4dc1/sphinx.go#L774-L784

Today this goroutine handles cleaning up the expired hashes based on CLTV values: https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/decayedlog.go#L204-L254. As mentioned above, what's missing is that we don't remove items from the batchReplayBkt. It's safe to remove items from this bucket once all the stored CLTV heights have expired.

Ultimately, I don't think there's a great way to handle the deletion other than just iterating through the batchReplayBkt itself. Similar to the old revocation migration, I think we can make the this migration optional, but then from where forward, update gcExpiredHashes to also scan the replay bucket and remove information there. We can life the resumable optional migration code from the revocation migration, since that lets the migration pick off where it left off last.

I think it should actually be safe to remove the items in the batchReplayBucket bucket since, replays of the nature we care about are detected here: https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/decayedlog.go#L315-L320. Though there's a related issue to handling the error here (though this is kinda nice as it exposed some bugs in other implementations).

So to delete the items:

  • Scan thru for the ID, which is actually the ID of the incoming channel and then the commitment height:
    • If we don't have that local channel anymore, then delete the replay or collect the ID to delete later.
    • You can scan thru w/ the height prefix to delete all other offending items.
    • This covers all the closed channels.
  • Next we can scan through for open channels:
    • For each open channel, we can do a reverse seek to grab the largest height associated with that channel in the replay bucket.
    • Then can use that height in the revocation bucket to find the corresponding block height based on the CLTV expiry of the HTLCs in the bucket
    • Mark for deletion if it's below the current known height.

During the scan, if an empty batch is found (shouldn't be the case any longer since we don't allow empty commitments), then it can be removed.

Then we'd follow up w/ what @C-Otto mentioned above re also storing the height along side the entry, so it can be safely deleted.

Roasbeef avatar Nov 10 '22 01:11 Roasbeef

Thanks for looking into this!

Regarding the tests: There wasn't any test when the code was added, which is what I stated. But yeah, some tests were added after the fact, that's something. However, there still isn't a single test related to the PutBatch feature, which is my main criticism.

Regarding empty batches: It happens with 0.15.4, quite a lot. I see several PutBatch invocations for empty batches, which naturally don't have any replay. I don't know where/how those are forbidden, but whatever that code does, it doesn't do it right (or it's not in 0.15.4, yet).

Regarding closed channels: From an architectural point of view, I'd rather not carry this information into DecayedLog, that seems very intrusive. I also think it doesn't belong here, as the ID is meant to be generic, and it just happens to be related to the channel ID. The DecayedLog doesn't know this itself, and as such it's the responsibility of code coming up with this ID to also check for IDs that can be removed. However, that's my "clean code OOP Java" inner self talking, which doesn't know how things are done in the Go world.

Regarding the batch GC: I don't see the advantage of computing different heights for old data of different channels. By not making this distinction, we can simplify the code (see my draft PR) and handle open&closed channels in the same way. With 300 million old entries in some databases, I don't think a few blocks/days make a noticable difference. Aside from that, if you don't store the deletion height for the old entries, for a very active channel we'd never delete old entries (as the maximum CLTV would increase before the current block height reaches the old maximum). By just taking one maximum value and using it as the deletion threshold for all old entries, we also don't need to think about empty batches. These are deleted along the way (but, of course, we should not store new empty batches, see my draft PR).

C-Otto avatar Nov 10 '22 07:11 C-Otto

Regarding the tests:

If you look at this commit, tests are indeed there. tx.ProcessOnionPacket calls PutBatch. Sure you can argue there should be a lower level tests there, but this were talking about a commit that's 5 years old. If you wish, you can certainly add a lower level test here yourself (just grasping at straws here tbh).

Regarding empty batches:

Looks like the issue is that when you revoke, you don't always have new HTLCs to forward, however, this is called unconditionally: https://github.com/lightningnetwork/lnd/blob/f6ae63fd7b53e366f50a96216fb056dd8fb8f4a7/htlcswitch/link.go#L1990-L2049

So that should either just exit early in the func, or only be called if the number of adds is non zero. We can then add an assertion (that we never try to write an empty batch) to catch regressions in the future.

Roasbeef avatar Nov 12 '22 00:11 Roasbeef

tx.ProcessOnionPacket calls PutBatch

Ah, yes, thank you for clarifying the root of this confusion! I was looking for tests invoking PutBatch directly. Others might be happy with the existing tests, I'm not - which is my opinion, based on my experiences in other areas of sofware engineering (I believe a developer used to writing more specific tests would have found the "empty batch" and "no GC" issues). I edited my initial criticism accordingly and, aside from the resulting confusion and misunderstanding, I think it still stands.

C-Otto avatar Nov 12 '22 13:11 C-Otto

are you able to use something like boltbrowser and scan through your batchReplayBkt and see how many of the entries are empty (0-length values) and then how many are non-empty and report back?

Crypt-iQ avatar Nov 15 '22 21:11 Crypt-iQ

I was able to tweak boltbrowser to count for me (and I verified this manually):

All 69 million entries in batch-replay have an empty value. Note that I applied the attached draft-PR's fixes a while ago, so that no more empty batches are added. This PR also GCs newer batches from the bucket.

C-Otto avatar Nov 16 '22 11:11 C-Otto

since replays are unlikely I don't think they need to be gc'd. better fix is to add migration that deletes empty batches and then add the bugfix to not persist empty batches anymore

Crypt-iQ avatar Nov 16 '22 12:11 Crypt-iQ

Are you sure this is correct? The replay-set bucket only stores the replays for a given batch/set. The set itself might be non-empty. If it does not contain a replay, we need to remember this. Otherwise, if we process the same set/batch again, we'd encounter a replay, as the individual onions have been persisted in the shared-hash bucket. I think this is what @guggero and I discussed above.

C-Otto avatar Nov 16 '22 12:11 C-Otto

you're right it needs to be persisted

Crypt-iQ avatar Nov 16 '22 13:11 Crypt-iQ

Background on replays and how LND handles them:

  • What is a replay and why does it matter?

    • A replay is where the same onion packet traverses the same outgoing path, again. We don't want to relay these packets in case somebody is attempting to determine the path of the HTLC.
    • We can receive a replayed packet on a different channel than the original one it was received from.
  • When do replays no longer matter in lnd?

    • When the CLTV of the incoming HTLC has expired. This is because lnd stores the incoming timeout in the sharedHashes bucket.
  • Lifetime of a batch:

    • A batch stores the set of replays that have occurred in a particular forwarding package. The key is the ID of the forwarding package. The value is the set of replays that occurred (possible empty).
    • A batch is stored with DecayedLog.PutBatch. This happens when DecodeHopIterators in processRemoteAdds. This may or may not be the first time the function has run.
    • If DecodeHopIterators is called and the batch already exists, the set of replays are fetched and returned. The main purpose of the batch is so that if the link runs DecodeHopIterators again, the set of replays for that forwarding package / batch can be retrieved. This allows the link to fail these HTLCs backwards.
    • The above point means that the batch only matters for the lifetime of the corresponding forwarding package. After that, DecodeHopIterators will never run for the batch ID.
  • Forwarding packages are removed when:

    • The link starts up and the state is FwdStateCompleted in resolveFwdPkgs. This might not occur depending on if other links are up and able to receive the add/settle/fail.
    • Periodically during link operation in fwdPkgGarbager if in state FwdStateCompleted. This might not occur depending on if other links are up and able to receive the add/settle/fail.
    • When the channel is closed via Packager.Wipe().

The issue can be solved with a migration and with new non-migration code:

  • Migration:
    • For closed channels, delete all batches when the channel doesn't exist
    • For open channels, delete batches where the forwarding package doesn't exist
  • New code:
    • For closed channels, delete all batches for the channel in the same database transaction. This happens in CloseChannel
    • For open channels, delete the batch when the forwarding transaction is removed. This should be done in the same database transaction.

One thing to note about the above approach is that it's possible that the batches linger until the channel is closed due to forwarding packages not getting removed. However, there is nothing we can do about this because the link may call DecodeHopIterators again and the batch must exist or else things will break.

Crypt-iQ avatar Dec 07 '22 16:12 Crypt-iQ

is this planned anytime soon. My sphinxreplay.db reached 1 GB nearly 30% of my channel db size

BhaagBoseDK avatar Aug 20 '23 10:08 BhaagBoseDK

is this planned anytime soon. My sphinxreplay.db reached 1 GB nearly 30% of my channel db size

is this planned anytime time soon. The sphinxreplay is now 1.6 GB grown by 600 MB in 5 months.

BhaagBoseDK avatar Dec 03 '23 17:12 BhaagBoseDK

Any news or merged PRs to reduce the size of the sphinxreplay database by deleting old entries? My sphinxreplay db is almost 3GB large. Compacting has minimal effects.

M1ch43lV avatar Apr 15 '24 11:04 M1ch43lV