lnd
lnd copied to clipboard
Peer backup
Change Description
This PR implements the feature bits and messages in the peer backup proposal: https://github.com/lightning/bolts/pull/1110
Read comment for more info: https://github.com/lightningnetwork/lnd/pull/8490#issuecomment-1967832342
This PR adds support for storing other peer's backup data. It introduces a storage layer which persists this data and resends this data o peers on reconnection.
Start reviewing from here: a9388032baa72a044adc5256d2633151b8012798
Steps to Test
Interoperability test with cln: https://github.com/Chinwendu20/Miscellanous/blob/master/peerstoragetest/README.md
Pull Request Checklist
Testing
- [ ] Your PR passes all CI checks.
- [ ] Tests covering the positive and negative (error paths) are included.
- [ ] Bug fixes contain tests triggering the bug to prevent regressions.
Code Style and Documentation
- [ ] The change obeys the Code Documentation and Commenting guidelines, and lines wrap at 80.
- [ ] Commits follow the Ideal Git Commit Structure.
- [ ] Any new logging statements use an appropriate subsystem and logging level.
- [ ] Any new lncli commands have appropriate tags in the comments for the rpc in the proto file.
- [ ] There is a change description in the release notes, or
[skip ci]in the commit message for small changes.
📝 Please see our Contribution Guidelines for further guidance.
[!IMPORTANT]
Auto Review Skipped
Auto reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>.Generate unit testing code for this file.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai generate unit testing code for this file.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai generate interesting stats about this repository and render them as a table.@coderabbitai show all the console.log statements in this repository.@coderabbitai read src/utils.ts and generate unit testing code.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (invoked as PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger a review. This is useful when automatic reviews are disabled for the repository.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai helpto get help.
Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
CodeRabbit Configration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
!! This is not gofmt'ed yet and there might be formatting nits.
I have converted this to a draft PR, I will like to get initial feedback on the design before I submit a proper one.
TODO:
- Add
prunepeerbackupcommand (That would be the way we get rid of irrelevant peer backups) - Add
restorefrompeerin lnd config. That is on start up we attempt to restore channel backups from peers. format:restorefrompeer= 1 means restore from peers and 0 means do not.
Unfortunately I have to Approach NACK this one. I think the purpose of this has been misunderstood. I suggest we get on a call ASAP to clear up the misunderstanding so you can proceed with better clarity on what needs to be done.
The short version here is that there are a handful of things that need to be handled:
- We need to implement a storage layer that we use to load and store data that our peer tells us to store. This should not be done via the
chanbackuppackage.
That is where the swapper comes in.
- We need to implement an API that can load and store data that we choose to store with our peer on a per peer basis.
- We would like to implement a more advanced striping system on top of the basic single message blob.
- We need to think about what data needs to go into that backup layer so that we can recover in-flight htlcs when the SCB is restored
- We need to think about how we want to structure our striping in a way that is both robust to peers going offline, and is incentive aligned.
There is a lot to be done here and it won't all be done in a single PR. I would like to see the first PR only implement 1 and 2. From there we will add follow up PRs that address later bullets here. Let's just start with the minimum viable protocol implementation.
Thanks for the review and there is nothing unfortunate about it. I agree we need to get on the call and understand both our perspectives on this.
Here is what needs to happen in this project overall. We will focus on each step individually. Do not do more than the required scope for each PR. Each step should be a separate PR. We will focus on merging the earlier ones before reviewing the later ones.
Preliminaries
This peer backup protocol outlined in bolts#1110 really has two different things going on.
First, there is the service provider part of the protocol, where we accept requests to store blobs for our peers, and retrieve them later and give them back upon reconnection.
Second, there is the service consumer where our node requests that our peer store data for us.
These two parts of the protocol basically have no overlap. The only code sharing they should have is the wire message format. Everything else is going to be different
Step 1: Allow our peers to store data with us.
I request that you reduce the scope of this PR to only handle this case. The key components we need are as follows:
- An interface that manages this data on disk. It should be a simple key value store, making use of the
kvdbpackage. The key should be our peer's public key, and the value should be the blob. - A message handler on the Brontide that handles the
peer_storagemessage. This should make use of the above interface to store the data on disk. - In the
Startmethod of theBrontideadd logic afterloadActiveChannelsthat if there are any active channels, we yield to themyour_peer_storageif we have one on file. - Inside the layer that manages the disk storage, we should includee a coroutine that periodically runs and garbage collects blobs that are no longer in use (what your refer to as pruning in this PR).
Step 2: Allow our node to store data with our peer. (NOT THIS PR!)
- Add to the
Brontidea method that takes an arbitrary blob and then creates apeer_storagemessage and sends it to the peer. - Upon initial send we need to track it and await an initial
your_peer_storagemessage. When we receive this message we will mark it as fully committed. - If we receive a
your_peer_storagemessage (without an outstanding verification step) then we will verify that blob (more on that below) and if it passes verification, then we will (only for now) attach it to theBrontidestruct. This will change later as we design more advanced ways to make use of this API. - The API that sends out our
peer_storagemessage should verify that the requested message is sufficiently small, should append a fixed length checksum to the end, and then encrypt it.
Commentary
Notice how none of this really mentions backups yet. That's because how we choose to use this peer storage layer is a decision that happens once we have the capability. We will get there in due time, but before we can we have to make sure these parts are right. LND is mission critical software and so it's important that we make all of these changes to it methodically and not take any shortcuts.
Chiming in here to co-sign @ProofOfKeags's comment above. We need to zoom out a bit to do some more upfront planning to make sure each PR that progressively implements and integrates this feature is well scoped.
In the first phase (this PR), we just want to be able to read and write the blobs we store with the remote peer (and them for us) that we have/had channels open with. This can be shipped by itself, as it implements the generic protocol feature, but doesn't add any lnd specific logic yet. He's describe this phase above, and it can be broken down into two PRs as noted.
In the second phase, we'll start to actually store our lnd-specific data amongst our supporting peers. We need to deploy the first phase before this, as otherwise there'll be no nodes out there we can use to implement and test this PR in the wild. In this phase, we'll need to do some upfront design to determine how we select a peer for storage (maybe we only want high uptime peers as an example), and also how we select which SCBs to store with which peers (see the RAID striping idea in this issue).
In the third phase, we'd combine the work in the prior phases to implement the peer storage aware SCB recovery. At this point, we're storing blobs for recovery with our peers. If we assume a user only has a seed, then we can use the channel graph to try to find our old channel peers, crawling each peer until we think we have a complete SCB picture. These blobs can then be used to boostrap the SCB protocol as we know it today.
In a fourth later phase, we can implement the ideas on #8472 to start to store HTLC state on a best effort basis. This changes more frequently (potentially several times a second, or even minute) so we need something to aggregate the updates, upload them, and then later on sweeper awareness to start sweeping the HTLCs.
Okay thank you @Roasbeef and @ProofOfKeags
In the Start method of the Brontide add logic after loadActiveChannels that if there are any active channels, we yield to them your_peer_storage if we have one on file.
What about if we still have their blob on file but no active channels with the peer? I think we should still send it since it has not been cleared yet.
That means we would only advertise one feature bit for this PR, the one that advertises that we are storing other peer's data, right?
I would also like to understand why we would prefer using kvdb over files to store the data.
Thank you.
What about if we still have their blob on file but no active channels with the peer? I think we should still send it since it has not been cleared yet.
Yeah this is perfectly alright to include in this PR. We can provide it if we have it on file and then it can be up to us to clear that data when we deem appropriate (taking into account the 2016 block delay recommended in the spec).
I would also like to understand why we would prefer using kvdb over files to store the data.
We use kvdb for just about all of our persistence in LND. The SCB file is an exception and that exception is made due to its use context. In the case of SCB we want to partition it away from other node state so that users can easily save it to another storage medium. We do not need that as a property of this system here because users need not back up the data our peers send us here.
That said, if you do your interface design well we should be able to easily interchange between a file based and a kvdb based scheme with minimal consequences to the contact surface between this code and the rest of the codebase. 🙂
I do not know if we should use a garbageCollector or let the node operator themselves make a decision that they would like to release their peer storage via issuing an lncli command. The spec does say at least, so maybe we can leave it for the node operator to decide.
Should this be made optional? like an lnd config indicating if a node wants to store backup for other peers?
I have no idea why there is this error, this is actually defined.
@Chinwendu20 - so the way to tell what is going on here (with the failing CI checks) is: notice that the unit tests pass but that the integration tests do not. So it will have something to do with the integration tests. Then, if you go to the thing it says is not defined, ie: the ProtocolOptions struct, if you go to the top of that file, you can see that it only compiles when the integration build flag is not set. That indicates that there should then also be one for when the integration flag is set. So if you search the code base for type ProtocolOptions struct - you will see there are actually 2 of these. The other is is protocol_integration.go -> you need to define PeerStorage there too.
Thank you all for the review. I would implement your suggestions and also work to adapt the code to the new changes being proposed in the spec which is currently undergoing arguments and suggestions. For the interop testing, I think only CLN has implemented this and I think they might have to update their implementation as certain things have changed. So I guess we would do that when there is a compatible implementation. I would incorporate your suggestions, the new spec changes and fix merge conflicts at once ( when the spec doc has been updated to reflect the recent suggestions and requested changes)
In the mean time I would create a separate PR for the changes I made to the brontide test, so that, that can be merged, code health+ ?
In the mean time I would create a separate PR for the changes I made to the brontide test, so that, that can be merged, code health+ ?
Yeah if you wanna extract those commits out (I think there are 5 of them?) and make them into a separate PR and then rebase this one off of that, that would work. Do you know how to do that?
In the mean time I would create a separate PR for the changes I made to the brontide test, so that, that can be merged, code health+ ?
Yeah if you wanna extract those commits out (I think there are 5 of them?) and make them into a separate PR and then rebase this one off of that, that would work. Do you know how to do that?
I created a different branch and dropped the irrelevant commits. I do not know if there is a better way to do this. I have tagged you on the PR.
I created a different branch and dropped the irrelevant commits. I do not know if there is a better way to do this. I have tagged you on the PR.
You'll need to rebase this branch on top of that one and then drop the duplicate commits.
Steps to Test
Interoperability test with cln: https://github.com/Chinwendu20/Miscellanous/blob/master/peerstoragetest/README.md
I wrote code for the interop testing with cln so it can easily be reproduced. I have updated the PR description with that.
This PR is dependent on this: https://github.com/lightningnetwork/lnd/pull/8631
How would delaying storage prevent dos attack? I think rate limiting only makes sense if the peer is requesting a resource, then we can control how we respond per time. I don't think there is anything we can do from our end to stop a peer from sending us an update? Does it wait till the previous sent one is stored? How does it know the previous sent one is stored?
How would delaying storage prevent dos attack? I think rate limiting only makes sense if the peer is requesting a resource, then we can control how we respond per time. I don't think there is anything we can do from our end to stop a peer from sending us an update? Does it wait till the previous sent one is stored? How does it know the previous sent one is stored?
As mentioned in the comment I left, currently the peer has full control over the number of DB write transactions we start. So i just mean, keep it in memory a bit and flush on a ticker.
How does it know the previous sent one is stored?
it doesnt. and it also should not depend on us persisting each update as mentioned in the spec. If the peer is sending two updates in quick succession, then it also should not care that we never persisted the first one since the second will replace it anyways
As mentioned in the comment I left, currently the peer has full control over the number of DB write transactions we start. So i just mean, keep it in memory a bit and flush on a ticker.
I think we might not be talking about the same thing? This is the comment that you are referring to? did not mention the above
I think we may still want to do the rate limiting suggestion here. Since with this, the peer can DoS us pretty easily and cause us to do many DB writes one after the other.
So I suggest adding a simple subsystem with flush ticker that handles this. https://github.com/lightningnetwork/lnd/pull/8490#discussion_r1595251032
My earlier comment was responding to this and what is written in the spec:
I was tying to understand the connection between delaying storage and avoiding DOS attacks
yep - this is all talking about the same thing. I think we can delay storage for a bit to rate limit the peer so that they cant single handedly cause us to keep opening write transactions.
Ahh I now realize you talked about DB writes in your previous comment.
But with this they can keep sending us as much message as they want though, we would just be reducing the frequency of the write transaction. I do not think that helps much? Isn't this what DOS, is? https://www.britannica.com/technology/denial-of-service-attack
I think here we also just mean DoS to mean anything that makes our node run significantly slower. So if they can cause us to constantly create write transactions - that is one way.
Doesn't have to be in this PR but we'll need to update our fuzzing harness for the new message type
I do not know if we should use terms like "rate limit" and DOS to describe this then. It does not convey the true intent. We can just say: We delay storage ( or better still batch storage?) to optimize writes. So this does not really "rate limit" or ensure that one particular peer stops us from responding to the others. Optimizing writes is not just important when we have one peer sending us multiple message but maybe when we have multiple peers sending us like a ton of messages?
I do not know if we should use terms like "rate limit" and DOS to describe this then. It does not convey the true intent. We can just say: We delay storage ( or better still batch storage?) to optimize writes. So this does not really "rate limit" or ensure that one particular peer stops us from responding to the others. Optimizing writes is not just important when we have one peer sending us multiple message but maybe when we have multiple peers sending us like a ton of messages?
Yeah the term you're looking for is "debouncing". The spec honestly shouldn't be saying anything about this. It isn't really a protocol requirement. It could say "MUST NOT send more than one update per second" which would give us license to do whatever we want in the case that they do.
So I suggest adding a simple subsystem with flush ticker that handles this.
The trouble with this approach is that it is more likely to lose state on protocol-valid messages. There's probably a way to use a cooldown ticker or something where we commit state immediately if it's been longer than a second since the last one. A naive ticker approach leaves reliability on the table. While the peer isn't supposed to depend on it, we should try to commit results immediately if possible and only resist doing so in cases of abuse.
Hello @ellemouton, please can I get your response on @ProofOfKeags' comment so that I can know how to proceed?
A naive ticker approach leaves reliability on the table
Sure yeah, wasn't being specific or prescriptive about implementation details 😊 The cool-down method falls into the umbrella I was referring to 👍 Seems like a simple enough thing to implement "just because & just in case".
I do not know if we should use terms like "rate limit" and DOS to describe this then. It does not convey the true intent.
Sure. Semantics that can be discussed on the spec PR side if you want. Here we just need to focus on the code that does the job. In my mind, still a DoS vector since a network message is mapped directly to a write transaction.
Main point:
But anyways - at the end of the day, the fact that we do only allow peers who have channels with us to send us this message - that perhaps makes things "expensive" enough since there is an upfront cost to the peer. If we ever do add an "peer allow-list" for storing data for peers who do not have channels to us, then I think we definitely should implement the persisting rate limit.
Is it like the same thing that you did for batch writing filters? But then we do not batch write we just take the most recent update from a particular peer after one second? @ellemouton
The approach you're looking for is to require a channel read from a Ticker but rather than writing every time the ticker fires, you let it buffer and sit there until a new write comes in. Every time a write comes in you pull from the ticker (which in the vast majority of cases should be available) and then write immediately after than channel read.