lightning icon indicating copy to clipboard operation
lightning copied to clipboard

connectd pegging 100% CPU

Open whitslack opened this issue 2 years ago • 8 comments

Issue and Steps to Reproduce

Running v0.11.1, lightning_connectd is utilizing 100% of a CPU core. Attaching strace to the process reveals that it is hammering pread64 calls on the gossip_store file. The reads are all very small, most exactly 12 bytes in size, and none in my sample even approached the page size. Why is connectd not memory-mapping this file for performance? Syscalls are to be minimized wherever possible.

whitslack avatar Jun 01 '22 18:06 whitslack

First step is not to do dumb things, second step is to optimize dumb things :)

We scan for two reasons: first, when they connect we send them any gossip we made ourselves. We do this in a naive way, by scanning the entire store. Fix: put our own gossip in a separate store file, which is what @cdecker wants to share gossip files anyway. This adds some gossmap complexity, however, which now needs to handle two files.

Second, when they send a gossip_timestamp_filter message, we scan the entire store to see if any match the filter they've given. But it's not unusual to send a dummy filter to say "I don't want anything from you": LND and CLN both use 0xFFFFFFFF for this, so I've optimized that.

rustyrussell avatar Jun 26 '22 04:06 rustyrussell

v0.11.2 might be ever so slightly better on this issue, but lightning_connectd is still hovering around 90% CPU utilization.

Is it expected to have a constant parade of

INFO    ...-channeld-chan#233342: Peer connection lost
INFO    ...-chan#233342: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

and

INFO    ...-chan#153417: Peer transient failure in CHANNELD_NORMAL: Reconnected

in the log? I know I have a lot of peers, but the rate of link flapping still seems excessive. And how is "Reconnected" ever the explanation of a "Peer transient failure"? That seems fishy to me.

Also, I have noticed that after some time my node stops accepting incoming connections entirely. I thought it was happening due to a file descriptor mixup while running under Heaptrack, but it happened again even with no Heaptrack in the mix.

whitslack avatar Jun 27 '22 06:06 whitslack

These log messages are overzealous. I'm seeing the "Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died" mainly when peers don't respond to pings (which is a message you only get at DEBUG level, but should probably be INFO).

With 21 connected peers, I am seeing 36 of these in 24 hours (yeah, one peer responsible for over half).

The "Reconnected" message is when they reconnect to us and we have an already-live connection, so we abandon the old one in favor of the new; I've seen 3 of these in the same period.

I haven't seen the failure to accept incoming connections! That's weird...

rustyrussell avatar Jun 29 '22 02:06 rustyrussell

I haven't seen the failure to accept incoming connections!

I think it was due to connectd exceeding the file descriptor limit, even though my limit is set at 4096 and I have fewer than 1000 open channels. (I've lost so many channels since upgrading to 0.11.x.)

whitslack avatar Jun 29 '22 06:06 whitslack

fd leak? That should show up in ls -l proc/$(pidof lightning_connectd)/fd. Mine shows 59 after 5 days... I'll check with the debugger tomorrow to see if any are untracked.

rustyrussell avatar Jun 29 '22 06:06 rustyrussell

@rustyrussell: My lightning_connectd currently has 2341 open FDs.

whitslack avatar Jun 29 '22 20:06 whitslack

We should take the discussion of the potential FD leak over to #5353 and leave this issue for the CPU usage.

whitslack avatar Jun 29 '22 21:06 whitslack

I'm going to leave this PR open. Though it's mitigated for 0.12 in #5342 I know that CPU usage can be further significantly reduced. @whitslack is most likely to see this (the CPU usage will be on first connect), so I'm leaving it open, earmarked for next release.

rustyrussell avatar Jul 09 '22 02:07 rustyrussell