lnd icon indicating copy to clipboard operation
lnd copied to clipboard

graphdb: add caching for isPublicNode query

Open Abdulkbk opened this issue 1 month ago • 2 comments

fixes #10337 continues #10356

Change Description

From the issue description:

CPU profiling reveals that IsPublicNode queries consume 62.36% of total CPU time (21.77s out of 34.91s) during gossip message processing. This represents a critical performance bottleneck in LND's gossip subsystem that significantly impacts node synchronization and network message processing throughput.

In this PR, we add caching to the IsPublicNode method in the SQLStore. Since

The IsPublicNode function is called for every node announcement received from the Lightning Network gossip protocol.

Adding cache will significantly reduce database overhead and accelerate gossip message processing.

Steps to Test

Steps for reviewers to follow to test the change.

go test -tags=test_db_sqlite -run TestNodeIsPublic # to ensure we still return the correst status for nodes.

go test -tags=test_db_sqlite -run TestNodeIsPublicCacheInvalidation -v # to test we're correctly invalidating cache

Abdulkbk avatar Nov 12 '25 15:11 Abdulkbk

Summary of Changes

Hello @Abdulkbk, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a caching mechanism for the IsPublicNode query, which was identified as a major CPU bottleneck during Lightning Network gossip message processing. By caching the public status of nodes, the system can avoid repeated database queries, thereby significantly improving performance, reducing database overhead, and accelerating the overall gossip subsystem. The changes include the cache implementation, robust invalidation logic, and new configuration options.

Highlights

  • Performance Improvement: Implemented an LRU cache for the IsPublicNode query within the SQLStore to significantly reduce CPU consumption during gossip message processing, addressing a critical performance bottleneck.
  • Cache Integration: The IsPublicNode method now checks the cache first, retrieving the public status of a node if available, or querying the database and storing the result for future use.
  • Cache Invalidation: Comprehensive cache invalidation logic has been added to ensure data consistency. The publicNodeCache is cleared for affected nodes whenever channel edges are added, deleted, marked as zombie, or when nodes are deleted, or during graph pruning and block disconnections.
  • Configuration Options: Introduced a new DefaultPublicNodeCacheSize constant and a PublicNodeCacheSize field in StoreOptions to allow configuration of the cache's capacity, defaulting to 15,000 entries.
  • New Test Coverage: A new test, TestNodeIsPublicCacheInvalidation, was added to thoroughly verify that the cache is correctly invalidated across various graph modification operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

gemini-code-assist[bot] avatar Nov 12 '25 15:11 gemini-code-assist[bot]

@abdulkbk, remember to re-request review from reviewers when ready

lightninglabs-deploy avatar Dec 03 '25 17:12 lightninglabs-deploy

I cherry-picked f9078e552a9a1652d432408f96dd472f8ceedef1 from https://github.com/lightningnetwork/lnd/pull/10356 adding the benchmark for IsPublicNode call and below is how it compares:

go test -tags=test_db_sqlite -bench=BenchmarkIsPublicNode -v
Scenario Iterations (b.N) Time per op (ns/op)
Cache 10422774 109.9 ns/op
No cache 390 2991256 ns/op (~2ms per op)

The difference is very significant with cache (~ 10000x faster).

Abdulkbk avatar Dec 15 '25 16:12 Abdulkbk

I notice the itests are all failing. Let me check to see why...

Abdulkbk avatar Dec 15 '25 16:12 Abdulkbk

can you add a release note entry for 20.1

ziggie1984 avatar Dec 15 '25 18:12 ziggie1984

I've reconsidered the cache design because I think the main flood of node announcements we are seeing are from public nodes on the network. Private node announcements we should only see from direct peers with private channels to us. So the current approach of invalidating cache on every channel close causes unnecessary churn.

Proposed approach: Public-nodes-only cache

The cache only stores public nodes. Once a node is in the cache, it's considered public (until LRU eviction or restart).

Rationale:

  • Private nodes are rare and limited to direct peers—caching them adds little value
  • Once a node's pubkey is public via gossip, that information has already propagated through the network
  • Brief DB inconsistency (node closes all channels but stays cached as public) is acceptable given Lightning's gossip semantics

Flow:

  1. Cache hit → return public
  2. Cache miss → query DB - If public → add to cache - If private → don't cache (will query DB next time)
  3. On node deletion → optionally remove from cache (not strictly necessary)

Benefits:

  • Eliminates cache churn for active public nodes (the 99% case)
  • No write-path overhead (no invalidation logic needed)
  • Simpler implementation
  • LRU naturally handles cleanup, and restart clears everything

The minor tradeoff (stale "public" status for nodes that close all channels) is acceptable since the gossip information has already leaked and the network treats these nodes as known.

ziggie1984 avatar Dec 17 '25 07:12 ziggie1984

And also look at this: https://github.com/lightningnetwork/lnd/blob/91423ee5198e116b210f7f1c2e8b506d5ab28266/discovery/gossiper.go#L1726

we in LND don't even don't send periodic node announcements if we do not have public channels which is another argument to not cache private nodes.

ziggie1984 avatar Dec 17 '25 07:12 ziggie1984

The cache only stores public nodes. Once a node is in the cache, it's considered public (until LRU eviction or restart).

This makes sense. I will drop the cache invalidation and only cache public nodes.

Abdulkbk avatar Dec 17 '25 13:12 Abdulkbk

I noticed that unit tests for postgres and sqlite backends are failing. It is expected because TestNodeIsPublic assumes that a node can transition from public to private after its channels are removed, which is now not case because of the 2 points highlighted by @ziggie1984:

Once a node's pubkey is public via gossip, that information has already propagated through the network Brief DB inconsistency (node closes all channels but stays cached as public) is acceptable given Lightning's gossip semantics

So I will update the test to reflect this behavior.

Abdulkbk avatar Dec 17 '25 16:12 Abdulkbk