lnd icon indicating copy to clipboard operation
lnd copied to clipboard

`lncli wtclient tower` uses massive amounts of memory

Open C-Otto opened this issue 2 years ago • 11 comments

Background

I run two lnd instances (0.15.0-beta.rc6), one of them serving as the watchtower for the other. Aside from an issue where lnd uses lots of memory probably related to the watchtower client code, I also noticed that memory usage increases even more when I issue lncli wtclient tower.

heap9

Currently, lnd consumes 10.9 GByte (RES) and 47.6 GByte (VIRT). The command is still running, memory usage is increasing.

This might be related to #5983. My watchtower.db is around 23 GByte (after compaction).

Your environment

  • version of lnd: 0.15.0-beta.rc6
  • which operating system (uname -a on *Nix): Linux server 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
  • version of btcd, bitcoind, or other backend: bitcoind v23

Steps to reproduce

Run lncli wtclient tower on a node with a configured and active remote watchtower. I think I started lnd while my watchtower was down, so that data queued up in memory. I also ran lncli wtclient tower before.

Expected behaviour

The command completes within seconds, memory usage doesn't change a lot.

Actual behaviour

Command takes ages to complete, memory usage increases by several GByte.

C-Otto avatar Jun 22 '22 11:06 C-Otto

My SSH connection died, but I think the command completed. Afterwards, memory usage dropped a lot: 3.2 GByte RES, 47.6 GByte VIRT.

heap10

C-Otto avatar Jun 22 '22 11:06 C-Otto

I think this is a duplicate of https://github.com/lightningnetwork/lnd/issues/5983? Do you want to keep this one open, or the other?

Roasbeef avatar Jun 22 '22 17:06 Roasbeef

Also related to https://github.com/lightningnetwork/lnd/issues/6259

Roasbeef avatar Jun 22 '22 17:06 Roasbeef

The other issue is about an watchtower being offline. This issue happens with an online tower. I think these are related, but different enough.

C-Otto avatar Jun 22 '22 18:06 C-Otto

I bumped in here to say: can confirm. lncli wtclient towers is taking ages right now. RAM usage went up to 37% from the usual 29%, system load is over 5 on this tiny Pi4. Something's wrong, and I'm not even rebalancing right now.

GordianLN avatar Jul 11 '22 19:07 GordianLN

Just timed lncli wtclient towers for the lolz:

real    15m3.573s
user    0m0.102s
sys     0m0.045s

GordianLN avatar Jul 13 '22 12:07 GordianLN

Even if the RPC request does not require session details (--include_sessions), these details are included as part of ListClientSessions. Gathering this data takes a lot of memory, as shown in the heap profile.

Inside listClientSessions:

// We'll load the full client session since the client will need
// the CommittedUpdates and AckedUpdates on startup to resume
// committed updates and compute the highest known commit height
// for each channel.

For the RPC request, it might suffice to count the entries instead of loading the details into memory, i.e. return a slim version of ClientSession.

Currently, the details (Channel ID, Commit Height) are collected, spanning all sessions known for the given tower (client_db.go, getClientSessionAcks):

var backupID BackupID
err := backupID.Decode(bytes.NewReader(v))
if err != nil {
    return err
}

ackedUpdates[seqNum] = backupID

C-Otto avatar Sep 04 '22 16:09 C-Otto

With the fixes in #6885 I don't see a noticable increase in RAM consumption.

Invoking time lncli wtclient towers gives:

...
            "num_sessions": 57275,
            "sessions": [
            ]
...

real	6m57.202s
user	0m0.058s
sys	0m0.015s

C-Otto avatar Sep 04 '22 19:09 C-Otto

~Can we keep just one of the three issues that track the same problem?~

  • https://github.com/lightningnetwork/lnd/issues/6660
  • https://github.com/lightningnetwork/lnd/issues/5983
  • https://github.com/lightningnetwork/lnd/issues/6886

EDIT: Issues are very likely related but not the same problem.

guggero avatar Sep 09 '22 11:09 guggero

Those are three different problems requiring three different solutions

C-Otto avatar Sep 09 '22 11:09 C-Otto

Right, my mistake. It looked to me like at least #6886 and #6660 were caused by the acked updates being kept in memory, but according to your comment above that doesn't seem to be the case.

guggero avatar Sep 09 '22 11:09 guggero