lnd icon indicating copy to clipboard operation
lnd copied to clipboard

Peer backup

Open Chinwendu20 opened this issue 1 year ago • 35 comments

Change Description

This PR implements the feature bits and messages in the peer backup proposal: https://github.com/lightning/bolts/pull/1110

Read comment for more info: https://github.com/lightningnetwork/lnd/pull/8490#issuecomment-1967832342

This PR adds support for storing other peer's backup data. It introduces a storage layer which persists this data and resends this data o peers on reconnection.

Start reviewing from here: a9388032baa72a044adc5256d2633151b8012798

Steps to Test

Interoperability test with cln: https://github.com/Chinwendu20/Miscellanous/blob/master/peerstoragetest/README.md

Pull Request Checklist

Testing

  • [ ] Your PR passes all CI checks.
  • [ ] Tests covering the positive and negative (error paths) are included.
  • [ ] Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.

Chinwendu20 avatar Feb 19 '24 16:02 Chinwendu20

[!IMPORTANT]

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot] avatar Feb 19 '24 16:02 coderabbitai[bot]

!! This is not gofmt'ed yet and there might be formatting nits.

I have converted this to a draft PR, I will like to get initial feedback on the design before I submit a proper one.

TODO:

  • Add prunepeerbackup command (That would be the way we get rid of irrelevant peer backups)
  • Add restorefrompeer in lnd config. That is on start up we attempt to restore channel backups from peers. format: restorefrompeer = 1 means restore from peers and 0 means do not.

Chinwendu20 avatar Feb 25 '24 14:02 Chinwendu20

Unfortunately I have to Approach NACK this one. I think the purpose of this has been misunderstood. I suggest we get on a call ASAP to clear up the misunderstanding so you can proceed with better clarity on what needs to be done.

The short version here is that there are a handful of things that need to be handled:

  1. We need to implement a storage layer that we use to load and store data that our peer tells us to store. This should not be done via the chanbackup package.

That is where the swapper comes in.

  1. We need to implement an API that can load and store data that we choose to store with our peer on a per peer basis.
  2. We would like to implement a more advanced striping system on top of the basic single message blob.
  3. We need to think about what data needs to go into that backup layer so that we can recover in-flight htlcs when the SCB is restored
  4. We need to think about how we want to structure our striping in a way that is both robust to peers going offline, and is incentive aligned.

There is a lot to be done here and it won't all be done in a single PR. I would like to see the first PR only implement 1 and 2. From there we will add follow up PRs that address later bullets here. Let's just start with the minimum viable protocol implementation.

Thanks for the review and there is nothing unfortunate about it. I agree we need to get on the call and understand both our perspectives on this.

Chinwendu20 avatar Feb 27 '24 11:02 Chinwendu20

Here is what needs to happen in this project overall. We will focus on each step individually. Do not do more than the required scope for each PR. Each step should be a separate PR. We will focus on merging the earlier ones before reviewing the later ones.

Preliminaries

This peer backup protocol outlined in bolts#1110 really has two different things going on.

First, there is the service provider part of the protocol, where we accept requests to store blobs for our peers, and retrieve them later and give them back upon reconnection.

Second, there is the service consumer where our node requests that our peer store data for us.

These two parts of the protocol basically have no overlap. The only code sharing they should have is the wire message format. Everything else is going to be different

Step 1: Allow our peers to store data with us.

I request that you reduce the scope of this PR to only handle this case. The key components we need are as follows:

  1. An interface that manages this data on disk. It should be a simple key value store, making use of the kvdb package. The key should be our peer's public key, and the value should be the blob.
  2. A message handler on the Brontide that handles the peer_storage message. This should make use of the above interface to store the data on disk.
  3. In the Start method of the Brontide add logic after loadActiveChannels that if there are any active channels, we yield to them your_peer_storage if we have one on file.
  4. Inside the layer that manages the disk storage, we should includee a coroutine that periodically runs and garbage collects blobs that are no longer in use (what your refer to as pruning in this PR).

Step 2: Allow our node to store data with our peer. (NOT THIS PR!)

  1. Add to the Brontide a method that takes an arbitrary blob and then creates a peer_storage message and sends it to the peer.
  2. Upon initial send we need to track it and await an initial your_peer_storage message. When we receive this message we will mark it as fully committed.
  3. If we receive a your_peer_storage message (without an outstanding verification step) then we will verify that blob (more on that below) and if it passes verification, then we will (only for now) attach it to the Brontide struct. This will change later as we design more advanced ways to make use of this API.
  4. The API that sends out our peer_storage message should verify that the requested message is sufficiently small, should append a fixed length checksum to the end, and then encrypt it.

Commentary

Notice how none of this really mentions backups yet. That's because how we choose to use this peer storage layer is a decision that happens once we have the capability. We will get there in due time, but before we can we have to make sure these parts are right. LND is mission critical software and so it's important that we make all of these changes to it methodically and not take any shortcuts.

ProofOfKeags avatar Feb 27 '24 22:02 ProofOfKeags

Chiming in here to co-sign @ProofOfKeags's comment above. We need to zoom out a bit to do some more upfront planning to make sure each PR that progressively implements and integrates this feature is well scoped.

In the first phase (this PR), we just want to be able to read and write the blobs we store with the remote peer (and them for us) that we have/had channels open with. This can be shipped by itself, as it implements the generic protocol feature, but doesn't add any lnd specific logic yet. He's describe this phase above, and it can be broken down into two PRs as noted.

In the second phase, we'll start to actually store our lnd-specific data amongst our supporting peers. We need to deploy the first phase before this, as otherwise there'll be no nodes out there we can use to implement and test this PR in the wild. In this phase, we'll need to do some upfront design to determine how we select a peer for storage (maybe we only want high uptime peers as an example), and also how we select which SCBs to store with which peers (see the RAID striping idea in this issue).

In the third phase, we'd combine the work in the prior phases to implement the peer storage aware SCB recovery. At this point, we're storing blobs for recovery with our peers. If we assume a user only has a seed, then we can use the channel graph to try to find our old channel peers, crawling each peer until we think we have a complete SCB picture. These blobs can then be used to boostrap the SCB protocol as we know it today.

In a fourth later phase, we can implement the ideas on #8472 to start to store HTLC state on a best effort basis. This changes more frequently (potentially several times a second, or even minute) so we need something to aggregate the updates, upload them, and then later on sweeper awareness to start sweeping the HTLCs.

Roasbeef avatar Feb 27 '24 23:02 Roasbeef

Okay thank you @Roasbeef and @ProofOfKeags

In the Start method of the Brontide add logic after loadActiveChannels that if there are any active channels, we yield to them your_peer_storage if we have one on file.

What about if we still have their blob on file but no active channels with the peer? I think we should still send it since it has not been cleared yet.

That means we would only advertise one feature bit for this PR, the one that advertises that we are storing other peer's data, right?

I would also like to understand why we would prefer using kvdb over files to store the data.

Thank you.

Chinwendu20 avatar Feb 28 '24 18:02 Chinwendu20

What about if we still have their blob on file but no active channels with the peer? I think we should still send it since it has not been cleared yet.

Yeah this is perfectly alright to include in this PR. We can provide it if we have it on file and then it can be up to us to clear that data when we deem appropriate (taking into account the 2016 block delay recommended in the spec).

I would also like to understand why we would prefer using kvdb over files to store the data.

We use kvdb for just about all of our persistence in LND. The SCB file is an exception and that exception is made due to its use context. In the case of SCB we want to partition it away from other node state so that users can easily save it to another storage medium. We do not need that as a property of this system here because users need not back up the data our peers send us here.

That said, if you do your interface design well we should be able to easily interchange between a file based and a kvdb based scheme with minimal consequences to the contact surface between this code and the rest of the codebase. 🙂

ProofOfKeags avatar Feb 28 '24 23:02 ProofOfKeags

I do not know if we should use a garbageCollector or let the node operator themselves make a decision that they would like to release their peer storage via issuing an lncli command. The spec does say at least, so maybe we can leave it for the node operator to decide.

Chinwendu20 avatar Feb 29 '24 17:02 Chinwendu20

Should this be made optional? like an lnd config indicating if a node wants to store backup for other peers?

Chinwendu20 avatar Mar 09 '24 02:03 Chinwendu20

image I have no idea why there is this error, this is actually defined.

Chinwendu20 avatar Mar 16 '24 17:03 Chinwendu20

@Chinwendu20 - so the way to tell what is going on here (with the failing CI checks) is: notice that the unit tests pass but that the integration tests do not. So it will have something to do with the integration tests. Then, if you go to the thing it says is not defined, ie: the ProtocolOptions struct, if you go to the top of that file, you can see that it only compiles when the integration build flag is not set. That indicates that there should then also be one for when the integration flag is set. So if you search the code base for type ProtocolOptions struct - you will see there are actually 2 of these. The other is is protocol_integration.go -> you need to define PeerStorage there too.

ellemouton avatar Mar 17 '24 10:03 ellemouton

Thank you all for the review. I would implement your suggestions and also work to adapt the code to the new changes being proposed in the spec which is currently undergoing arguments and suggestions. For the interop testing, I think only CLN has implemented this and I think they might have to update their implementation as certain things have changed. So I guess we would do that when there is a compatible implementation. I would incorporate your suggestions, the new spec changes and fix merge conflicts at once ( when the spec doc has been updated to reflect the recent suggestions and requested changes)

Chinwendu20 avatar Apr 04 '24 14:04 Chinwendu20

In the mean time I would create a separate PR for the changes I made to the brontide test, so that, that can be merged, code health+ ?

Chinwendu20 avatar Apr 04 '24 14:04 Chinwendu20

In the mean time I would create a separate PR for the changes I made to the brontide test, so that, that can be merged, code health+ ?

Yeah if you wanna extract those commits out (I think there are 5 of them?) and make them into a separate PR and then rebase this one off of that, that would work. Do you know how to do that?

ProofOfKeags avatar Apr 04 '24 19:04 ProofOfKeags

In the mean time I would create a separate PR for the changes I made to the brontide test, so that, that can be merged, code health+ ?

Yeah if you wanna extract those commits out (I think there are 5 of them?) and make them into a separate PR and then rebase this one off of that, that would work. Do you know how to do that?

I created a different branch and dropped the irrelevant commits. I do not know if there is a better way to do this. I have tagged you on the PR.

Chinwendu20 avatar Apr 08 '24 11:04 Chinwendu20

I created a different branch and dropped the irrelevant commits. I do not know if there is a better way to do this. I have tagged you on the PR.

You'll need to rebase this branch on top of that one and then drop the duplicate commits.

ProofOfKeags avatar Apr 08 '24 18:04 ProofOfKeags

Steps to Test

Interoperability test with cln: https://github.com/Chinwendu20/Miscellanous/blob/master/peerstoragetest/README.md

I wrote code for the interop testing with cln so it can easily be reproduced. I have updated the PR description with that.

This PR is dependent on this: https://github.com/lightningnetwork/lnd/pull/8631

Chinwendu20 avatar Apr 29 '24 12:04 Chinwendu20

How would delaying storage prevent dos attack? I think rate limiting only makes sense if the peer is requesting a resource, then we can control how we respond per time. I don't think there is anything we can do from our end to stop a peer from sending us an update? Does it wait till the previous sent one is stored? How does it know the previous sent one is stored?

Chinwendu20 avatar May 09 '24 12:05 Chinwendu20

How would delaying storage prevent dos attack? I think rate limiting only makes sense if the peer is requesting a resource, then we can control how we respond per time. I don't think there is anything we can do from our end to stop a peer from sending us an update? Does it wait till the previous sent one is stored? How does it know the previous sent one is stored?

As mentioned in the comment I left, currently the peer has full control over the number of DB write transactions we start. So i just mean, keep it in memory a bit and flush on a ticker.

How does it know the previous sent one is stored?

it doesnt. and it also should not depend on us persisting each update as mentioned in the spec. If the peer is sending two updates in quick succession, then it also should not care that we never persisted the first one since the second will replace it anyways

ellemouton avatar May 09 '24 12:05 ellemouton

As mentioned in the comment I left, currently the peer has full control over the number of DB write transactions we start. So i just mean, keep it in memory a bit and flush on a ticker.

I think we might not be talking about the same thing? This is the comment that you are referring to? did not mention the above

I think we may still want to do the rate limiting suggestion here. Since with this, the peer can DoS us pretty easily and cause us to do many DB writes one after the other.

So I suggest adding a simple subsystem with flush ticker that handles this. https://github.com/lightningnetwork/lnd/pull/8490#discussion_r1595251032

My earlier comment was responding to this and what is written in the spec: image I was tying to understand the connection between delaying storage and avoiding DOS attacks

Chinwendu20 avatar May 09 '24 13:05 Chinwendu20

yep - this is all talking about the same thing. I think we can delay storage for a bit to rate limit the peer so that they cant single handedly cause us to keep opening write transactions.

ellemouton avatar May 09 '24 13:05 ellemouton

Ahh I now realize you talked about DB writes in your previous comment.

But with this they can keep sending us as much message as they want though, we would just be reducing the frequency of the write transaction. I do not think that helps much? Isn't this what DOS, is? https://www.britannica.com/technology/denial-of-service-attack

Chinwendu20 avatar May 09 '24 16:05 Chinwendu20

I think here we also just mean DoS to mean anything that makes our node run significantly slower. So if they can cause us to constantly create write transactions - that is one way.

ellemouton avatar May 09 '24 17:05 ellemouton

Doesn't have to be in this PR but we'll need to update our fuzzing harness for the new message type

Crypt-iQ avatar May 09 '24 17:05 Crypt-iQ

I do not know if we should use terms like "rate limit" and DOS to describe this then. It does not convey the true intent. We can just say: We delay storage ( or better still batch storage?) to optimize writes. So this does not really "rate limit" or ensure that one particular peer stops us from responding to the others. Optimizing writes is not just important when we have one peer sending us multiple message but maybe when we have multiple peers sending us like a ton of messages?

Chinwendu20 avatar May 09 '24 17:05 Chinwendu20

I do not know if we should use terms like "rate limit" and DOS to describe this then. It does not convey the true intent. We can just say: We delay storage ( or better still batch storage?) to optimize writes. So this does not really "rate limit" or ensure that one particular peer stops us from responding to the others. Optimizing writes is not just important when we have one peer sending us multiple message but maybe when we have multiple peers sending us like a ton of messages?

Yeah the term you're looking for is "debouncing". The spec honestly shouldn't be saying anything about this. It isn't really a protocol requirement. It could say "MUST NOT send more than one update per second" which would give us license to do whatever we want in the case that they do.

So I suggest adding a simple subsystem with flush ticker that handles this.

The trouble with this approach is that it is more likely to lose state on protocol-valid messages. There's probably a way to use a cooldown ticker or something where we commit state immediately if it's been longer than a second since the last one. A naive ticker approach leaves reliability on the table. While the peer isn't supposed to depend on it, we should try to commit results immediately if possible and only resist doing so in cases of abuse.

ProofOfKeags avatar May 09 '24 19:05 ProofOfKeags

Hello @ellemouton, please can I get your response on @ProofOfKeags' comment so that I can know how to proceed?

Chinwendu20 avatar May 10 '24 05:05 Chinwendu20

A naive ticker approach leaves reliability on the table

Sure yeah, wasn't being specific or prescriptive about implementation details 😊 The cool-down method falls into the umbrella I was referring to 👍 Seems like a simple enough thing to implement "just because & just in case".

I do not know if we should use terms like "rate limit" and DOS to describe this then. It does not convey the true intent.

Sure. Semantics that can be discussed on the spec PR side if you want. Here we just need to focus on the code that does the job. In my mind, still a DoS vector since a network message is mapped directly to a write transaction.

Main point:

But anyways - at the end of the day, the fact that we do only allow peers who have channels with us to send us this message - that perhaps makes things "expensive" enough since there is an upfront cost to the peer. If we ever do add an "peer allow-list" for storing data for peers who do not have channels to us, then I think we definitely should implement the persisting rate limit.

ellemouton avatar May 10 '24 08:05 ellemouton

Is it like the same thing that you did for batch writing filters? But then we do not batch write we just take the most recent update from a particular peer after one second? @ellemouton

Chinwendu20 avatar May 10 '24 09:05 Chinwendu20

The approach you're looking for is to require a channel read from a Ticker but rather than writing every time the ticker fires, you let it buffer and sit there until a new write comes in. Every time a write comes in you pull from the ticker (which in the vast majority of cases should be available) and then write immediately after than channel read.

ProofOfKeags avatar May 10 '24 18:05 ProofOfKeags