peer: only send pings if the connection is idle for one minute
Based on the specs, the ping is used for keeping the liveness of the connection. We perform this check every minute, regardless of whether the connection is live or not.
This PR now changes the behavior such that we only perform this check when the connection is idle - if we are busy receiving gossip messages over the wire, we know for sure the connection is alive and can delay sending the pings.
[!IMPORTANT]
Review skipped
Auto reviews are limited to specific labels.
:label: Labels to auto review (1)
- llm-review
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
🪧 Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>, please review it.Explain this complex logic.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai explain this code block.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.@coderabbitai read src/utils.ts and explain its main purpose.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Support
Need help? Create a ticket on our support page for assistance with any issues or questions.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai generate docstringsto generate docstrings for this PR.@coderabbitai generate sequence diagramto generate a sequence diagram of the changes in this PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
Prior to this change, pings would be sent every 1 minute, thus preventing this disconnect. Now it is possible that the peer sends a gossip message often enough to prevent the pings, and then gets disconnected as a result.
One idea would be to do an explicit ping check at that point before D/C in the writeHandler. If it works (pong received), we continue onwards, otherwise, we disconnect.
Here's a patch that implements that idea above: https://github.com/Roasbeef/lnd/commit/120375e720fa5f349fa87a2a9eae4263e6dbbd23
Before we would: "disconnect if we haven't written a message in 5 minutes".
Now we'll: "force a ping to be sent if we haven't written a message in 5 minutes".
Returning to @morehouse's scenario above, instead of never disconnect, we'll send a ping, then if that doesn't receive a timely response, we'll disconnect.
Really nice idea with the combination of the patch +1 on this approach.
Currently writeHandler will force a disconnect if no message has been successfully sent to the peer in 5 minutes.
Nice catch. I think the same goes for the readHandler, if the writer is busy and the ping to be sent is queued at the very bottom, it could cause the reader to time out too.
Here's a patch that implements that idea above
Thanks for the patch! In addition I think we need to have a similar check in the reader too. Tried a bit locally and realized we could avoid interfering the ping/pong logic. Basically I think we could just remove the idleTimer completely since we always fire a ping if the connection is idle, then we only need to catch read/write timeout error and optionally close the conn there. So it goes like this,
- If there's only read but no write, no ping is sent, and write won't time out.
- If there's only write but no read, no ping is sent, and read won't time out.
- When the connection is idle without any read or write, a ping/pong will be sent/received.
- if sending ping times out, we disconnect in
writeHandler. - if we haven't received a pong under 30 seconds, we disconnect, which could happen if,
- the peer doesn't respond with a pong, we disconnect in
pingHandler, or, - if reading pong times out, we disconnect in
readHandler.
- the peer doesn't respond with a pong, we disconnect in
- if sending ping times out, we disconnect in
Thus we rely solely on the ping/pong mechanism to check for liveness, and can get rid of the idleTimer.
Another approach is to share the idle timer between the reader and writer, which can make sure one busy read won't make the write time out, see the commit here https://github.com/yyforyongyu/lnd/pull/46/commits/1ab94472fb669f48fe6f6ed51074ebc01cd69853
Thus we rely solely on the ping/pong mechanism to check for liveness, and can get rid of the
idleTimer.
I really like this approach, but I'm afraid there's a DoS vulnerability hidden in here due to the way LND queues outgoing messages.
Current situation
Outgoing messages travel through a series of chans and queues before being sent via TCP to the peer:
Brontide.outgoingQueue(unbuffered)priorityMsgsorlazyMsgsinBrontide.queueHandler(unlimited list size)Brontide.sendQueue(unbuffered)
Once messages have been removed from outgoingQueue, they are buffered in either priorityMsgs or lazyMsgs, both of which have unbounded size. One by one, messages are removed from priorityMsgs then lazyMsgs and sent through the sendQueue.
Processing of messages through the sendQueue is synchronous -- writeHandler won't process a new message until the current message is successfully sent via TCP and ACKed by the peer. If the peer doesn't ACK on the TCP layer, writeHandler will retry sending for up to 5 minutes (idleTimeout) before disconnecting the peer.
Accidental DoS protection
A peer can purposely refuse to ACK on the TCP layer, thereby causing messages to back up in the outgoing priorityMsgs and lazyMsgs buffers. If this was the full story, memory use would grow unbounded and it would be trivial for any peer to DoS LND this way.
The idleTimeout provides only a tiny defense against such an attack, requiring the attacker to ACK one message every 5 minutes to avoid being disconnected. As long as messages are getting queued at a faster rate than this, memory use will still continue to increase unbounded.
However the ping-pong mechanism provides a much stronger (accidental) defense against this attack. Pings are sent regularly every ~1 minute with a random NumPongBytes specified in the message. The attacker must actually receive these pings to know the proper pong length to send in response to prevent disconnection after 30 seconds. If the attacker is refusing to ACK at the TCP layer, the ping message will never actually be sent to the attacker, and the attacker will be disconnected.
Situation after this PR
After this PR, pings will only be sent after 1 minute of no messages being sent or received successfully. So if an attacker regularly sends an arbitrary message to the victim, the attacker will never actually have to respond to a ping. That means the attacker can refuse to ACK at the TCP layer (except once every 5 minutes), causing the victim's memory use to grow unbounded.
Path forward
Really we need to have a discussion about proper outgoing message handling, including things like queue limits, dropping messages, forcing disconnection, etc. If we can mitigate potential DoS that way, then this PR would be great.
IIUC, we don't urgently need this PR since the ping timeouts have been largely mitigated by https://github.com/lightningnetwork/lnd/pull/9804. For now I think we should leave the current ping-pong mechanism as is, to protect against the DoS attack described above, so we don't hold up the 0.19 release on a more robust solution.
@morehouse Thanks for the detailed analysis! Yes unfortunately the remote peer can hold the ACK since TCP is full-duplex so they can keep sending us msgs without ACKing ours. Will put this PR on hold and revisit for 0.20, also have some ideas about how to better design this area, will create a doc to kick off the discussions.
@yyforyongyu, remember to re-request review from reviewers when ready
!lightninglabs-deploy mute