lndmon postgres backend - LND v0.14.1-beta "lnd compatibility check failed"

trafficstars

Need help understanding what's going on with my setup or if this is a bug.

Note, currently running lndmon for many nodes using the standard bbolt/boltdb backend. For some reason it seems like I'm getting errors when using LND with postgres.

logs:

2021-12-22 02:39:55.978 [INF] LNDMON: Starting Prometheus exporter...
2021-12-22 02:39:55.978 [INF] HTLC: Starting Htlc Monitor
2021-12-22 02:39:55.979 [INF] LNDMON: Prometheus active!
Lndmon exiting with error: GraphCollector DescribeGraph failed with: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2021-12-22 02:40:35.757 [INF] HTLC: Stopping Htlc Monitor
2021/12/22 02:40:35 Stopping Prometheus Exporter
GraphCollector DescribeGraph failed with: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Sometimes I'll just get this for the error in the logs:

lnd compatibility check failed: unable to get info for lnd node: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Dec 22 '21 02:12 miketwenty1

Sounds like the request is just timing out. lndmon uses the default RPC timeout of 30 seconds. Does it take longer than 30 seconds to call lncli getinfo on the postgres lnd?

Dec 22 '21 08:12 guggero

@guggero the response is nearly instant when I do a lncli getinfo. Let me know what else I should test.

Dec 22 '21 15:12 miketwenty1

Ah, I looked at the wrong error message. Seems like DescribeGraph fails, not GetInfo. Can you try if the error goes away by adding --caches.rpc-graph-cache-duration=5m? You might need to fill the cache initially with lncli describegraph, then the lndmon calls should be answered almost immediately.

Dec 22 '21 15:12 guggero

You're recommending I run lncli describegraph to cache for 5m instead of default of 1m on bootup of LND?

I ran LND with this config, I then ran the lncli describegraph, right afterwards if I start lndmon it will return as a healthy prometheus target, but after a bit of time it crashes with the same error.

Something to note in terms of latency:

It took 2 minutes and 39 seconds to respond to my lncli stop command, when I was bringing this node down for the cache update.
it took 1 minute and 50 seconds to run the lncli describegraph command, after I booted with new cache config.

Not sure if this would warrant a ticket in the lightningnetwork/lnd repo?

Dec 22 '21 16:12 miketwenty1

This is the same issue as https://github.com/lightningnetwork/lnd/issues/6107 then. The in-memory graph is exactly the same data as is served in describegraph. If it takes multiple minutes to load it on startup then it will take multiple minutes to scrape from the RPC, unless the RPC graph cache is turned on. But every time the graph cache expires, the first scrape will take that long again.

I see two ways to fix this (indirectly, the main fix will be to speed up the graph download in postgres): Set the rpc-graph-cache-duration to an infinitely long time (e.g. 8760h which is one year) to disable updating the graph data in lndmon. Or increase the default RPC timeout (must be added to this struct: https://github.com/lightninglabs/lndmon/blob/master/lndmon.go#L41) and the scrape interval to something larger than the 1 minute 50 seconds it takes to load the graph.

Dec 23 '21 09:12 guggero

Why is this only happening with postgres backend?

Dec 23 '21 23:12 miketwenty1

Why is this only happening with postgres backend?

Not sure what you mean... context deadline exceeded is Golang's way of saying "something timed out". So the error is because the DescribeGraph call takes too long with postgres.

Jan 03 '22 09:01 guggero

Looks like this is happening on postgres and not bbolt, can reproduce. getinfo took 2m4s to respond.

Jun 30 '22 06:06 sandipndev

lndmon lndmon copied to clipboard

postgres backend - LND v0.14.1-beta "lnd compatibility check failed"

lndmon
lndmon copied to clipboard