graphdb: add caching for isPublicNode query
fixes #10337 continues #10356
Change Description
From the issue description:
CPU profiling reveals that IsPublicNode queries consume 62.36% of total CPU time (21.77s out of 34.91s) during gossip message processing. This represents a critical performance bottleneck in LND's gossip subsystem that significantly impacts node synchronization and network message processing throughput.
In this PR, we add caching to the IsPublicNode method in the SQLStore. Since
The IsPublicNode function is called for every node announcement received from the Lightning Network gossip protocol.
Adding cache will significantly reduce database overhead and accelerate gossip message processing.
Steps to Test
Steps for reviewers to follow to test the change.
go test -tags=test_db_sqlite -run TestNodeIsPublic # to ensure we still return the correst status for nodes.
go test -tags=test_db_sqlite -run TestNodeIsPublicCacheInvalidation -v # to test we're correctly invalidating cache
Summary of Changes
Hello @Abdulkbk, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a caching mechanism for the IsPublicNode query, which was identified as a major CPU bottleneck during Lightning Network gossip message processing. By caching the public status of nodes, the system can avoid repeated database queries, thereby significantly improving performance, reducing database overhead, and accelerating the overall gossip subsystem. The changes include the cache implementation, robust invalidation logic, and new configuration options.
Highlights
- Performance Improvement: Implemented an LRU cache for the
IsPublicNodequery within theSQLStoreto significantly reduce CPU consumption during gossip message processing, addressing a critical performance bottleneck. - Cache Integration: The
IsPublicNodemethod now checks the cache first, retrieving the public status of a node if available, or querying the database and storing the result for future use. - Cache Invalidation: Comprehensive cache invalidation logic has been added to ensure data consistency. The
publicNodeCacheis cleared for affected nodes whenever channel edges are added, deleted, marked as zombie, or when nodes are deleted, or during graph pruning and block disconnections. - Configuration Options: Introduced a new
DefaultPublicNodeCacheSizeconstant and aPublicNodeCacheSizefield inStoreOptionsto allow configuration of the cache's capacity, defaulting to 15,000 entries. - New Test Coverage: A new test,
TestNodeIsPublicCacheInvalidation, was added to thoroughly verify that the cache is correctly invalidated across various graph modification operations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
@abdulkbk, remember to re-request review from reviewers when ready
I cherry-picked f9078e552a9a1652d432408f96dd472f8ceedef1 from https://github.com/lightningnetwork/lnd/pull/10356 adding the benchmark for IsPublicNode call and below is how it compares:
go test -tags=test_db_sqlite -bench=BenchmarkIsPublicNode -v
| Scenario | Iterations (b.N) |
Time per op (ns/op) |
|---|---|---|
| Cache | 10422774 | 109.9 ns/op |
| No cache | 390 | 2991256 ns/op (~2ms per op) |
The difference is very significant with cache (~ 10000x faster).
I notice the itests are all failing. Let me check to see why...
can you add a release note entry for 20.1
I've reconsidered the cache design because I think the main flood of node announcements we are seeing are from public nodes on the network. Private node announcements we should only see from direct peers with private channels to us. So the current approach of invalidating cache on every channel close causes unnecessary churn.
Proposed approach: Public-nodes-only cache
The cache only stores public nodes. Once a node is in the cache, it's considered public (until LRU eviction or restart).
Rationale:
- Private nodes are rare and limited to direct peers—caching them adds little value
- Once a node's pubkey is public via gossip, that information has already propagated through the network
- Brief DB inconsistency (node closes all channels but stays cached as public) is acceptable given Lightning's gossip semantics
Flow:
- Cache hit → return public
- Cache miss → query DB - If public → add to cache - If private → don't cache (will query DB next time)
- On node deletion → optionally remove from cache (not strictly necessary)
Benefits:
- Eliminates cache churn for active public nodes (the 99% case)
- No write-path overhead (no invalidation logic needed)
- Simpler implementation
- LRU naturally handles cleanup, and restart clears everything
The minor tradeoff (stale "public" status for nodes that close all channels) is acceptable since the gossip information has already leaked and the network treats these nodes as known.
And also look at this: https://github.com/lightningnetwork/lnd/blob/91423ee5198e116b210f7f1c2e8b506d5ab28266/discovery/gossiper.go#L1726
we in LND don't even don't send periodic node announcements if we do not have public channels which is another argument to not cache private nodes.
The cache only stores public nodes. Once a node is in the cache, it's considered public (until LRU eviction or restart).
This makes sense. I will drop the cache invalidation and only cache public nodes.
I noticed that unit tests for postgres and sqlite backends are failing. It is expected because TestNodeIsPublic assumes that a node can transition from public to private after its channels are removed, which is now not case because of the 2 points highlighted by @ziggie1984:
Once a node's pubkey is public via gossip, that information has already propagated through the network Brief DB inconsistency (node closes all channels but stays cached as public) is acceptable given Lightning's gossip semantics
So I will update the test to reflect this behavior.