Syncthing: v1.27.3, Linux (64-bit Intel/AMD)
OS: Debian 12, Kernel 6.1.0-18-amd64
Size: 23.075 files, 1.738 dirs, ~45,6 GiB total size, 6 folders
Staggered versioning enabled (not sure if relevant)
inotify watches enabled

Observation

CPU usage is low in general (< 1 % on a small Intel SoC), but keeps growing slowly over time (after >2 days) to 2 % and above, i.e. by over a factor of two (in idle). After prolonged runtime of some weeks, I also observed > 4 %.

Reproduction steps

Either wait ;-), or run many folder rescans. Slowly hitting "Rescan all folders" about 80 times, then waiting until everything has settled down, CPU usage has increased, too. So likely, the regular hourly scan is causing it.

Further information

Initially reported as: https://forum.syncthing.net/t/slowly-increasing-cpu-usage-over-time/21639/2 (until I found a reproducer).

Note I also observe a higher rate of syscalls such as:

[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=2996919}) = -1 ETIMEDOUT (Connection timed out)
[pid  1605] <... epoll_pwait resumed>[], 128, 13, NULL, 0) = 0
[pid   962] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=1697811} <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 12, NULL, 0) = 0
[pid   962] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=3958898} <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 14, NULL, 0) = 0
[pid   962] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)

in this state, but not sure whether this is really related.

Feb 13 '24 18:02 olifre

Take a CPU profile?

https://docs.syncthing.net/users/profiling.html

Feb 13 '24 20:02 calmh

Wow, nice, kudos for this great documentation on how to grab a profile!

gziped them so I could upload here to GitHub:

Profile before, i.e. after Syncthing running for an hour or so, almost zero CPU load:: syncthing-cpu-linux-amd64-v1.27.3-215155.pprof.gz
Profile after slowly clicking "rescan all folders" for 2 minutes in a row, then waiting 2 minutes for things to settle down: syncthing-cpu-linux-amd64-v1.27.3-215657.pprof.gz

Hope this helps.

Feb 13 '24 21:02 olifre

Looks like it was running a database GC in the second profile. May have been an unlucky coincidence. Take another one?

Feb 13 '24 21:02 calmh

Sure, no problem — to decrease chances of anything else interfering, I grabbed three in a row:

Feb 13 '24 21:02 olifre

The increase seems to be in findRunnable, which could indicate a lot of goroutines (a goroutine leak somewhere). I ran Syncthing for a couple of hours with a 5s rescan on one of my folders, and saw no increase in goroutines at least. (Possibly a little increase in background CPU usage but within the margin of error, small fractions of a percent.) Could be something as simple as memory getting fragmented increase GC work or something, as well.

Feb 14 '24 16:02 calmh

Interesting, sure, memory fragmentation due to many rescans (after long runtimes) seems quite likely.

The reason why I stumbled upon this is actually since I am performing power consumption monitoring for the home server running Syncthing (and other things), and the x86_64 machine (Intel Pentium Silver N6005) usually uses 3 W after reboot, but was using >4 W after a week, growing towards this value with an almost linear increase over time. Of course, still not a lot, but 30 % increase (which of course also translates in UPS runtime) was reason for the nerd to start investigating, and the power consumption data basically looked as if there was a "leak" (not in memory, but in CPU usage). Since restarting Syncthing brought it back down to 3 W right away, I came here :wink: .

Then I found the increased number of epoll_wait syscalls which Syncthing appears to be doing after prolonged runtime, but maybe that's just the way goroutines interact.

Do you think it is likely that configuration can trigger this? I don't have anything (obviously) special configured apart from staggered versioning, which is probably not related.

Is there a way to dump the number of goroutines at runtime, then I could confirm that this number is also constant for me?

If we don't find the cause, or you think it is not fruitful to delve deeper, that is of course also perfectly fine with me — then I will likely work around this with a daily restart or decrease the interval for full scans (or both) for the system. Thanks in any case for checking and trying to reproduce on your end! (and of course for Syncthing itself, too :wink: )

Feb 14 '24 17:02 olifre

You can enable the built in profiler by running with --debug-profiler-listen 127.0.0.1:9090. That gives you a web page http://localhost:9090/debug/pprof/ where you can see counts of goroutines and other things, and also take profiles directly with the go profiler if so inclined, like go tool pprof http://localhost:9090/debug/pprof/goroutine or go tool pprof http://localhost:9090/debug/pprof/profile for a CPU profile

Feb 15 '24 06:02 calmh

There's also the new metrics endpoint that exposes a few things, which I collect on my home server. There I can corroborate idle CPU usage increase over time, clearly:

However it's not accompanied by an increase in goroutines or memory, so it's not a simple leak like that:

Feb 15 '24 07:02 calmh

It's also not seen as an increase in threads, GC time, file system operations, or open file descriptors (I'll spare you graphs not showing an increase).

Feb 15 '24 07:02 calmh

Many thanks, that's very enlightening! The built-in profiler which I missed up to now is really great. :heart:

I will also take a look, but I will probably not see anything new, my Go knowledge is very basic at the moment. Still, I feel much better that you are also seeing the effect in general in your monitored metrics, so at least it's not a very exotic configuration on my end.

Feb 15 '24 17:02 olifre

I can confirm that I also see CPU usage increase after a few days of uptime.

Edit: @calmh: Could you let me know which metric you used for the cpu usage? I can only see process_cpu_seconds_total in the /metrics endpoint but no current cpu usage

Feb 28 '24 21:02 gary094

For completeness, Alex has also reported this in the forums: https://forum.syncthing.net/t/slowly-increasing-cpu-usage-over-time/21639/4?u=olifre In his case, he reports that this only started to show up for him with 1.22.3, and not with 1.22.2.

Sadly, I don't have enough monitoring history to confirm (I never saw a noticeable increase before, but my monitoring was just worse in the past).

Feb 29 '24 11:02 olifre

I've run into this problem as well. A walk through the bowels of go's timer internals reveals that there are hundreds of timers with a two second period being processed. With the helpful hints about the scan routines, I found this path missing a timer stop:

--- a/lib/scanner/walk.go
+++ b/lib/scanner/walk.go
@@ -161,6 +161,7 @@ func (w *walker) walk(ctx context.Context) chan ScanResult {
 		}
 
 		if len(filesToHash) == 0 {
+			ticker.Stop()
 			close(finishedChan)
 			return
 		}

Mar 04 '24 00:03 colingibbs1

@colingibbs1 That's a great find! It also matches Alex' observation in the forums, i.e. that the effect is new with 1.27.3, as that "return path" without a stop of the timer was newly added in 1.27.3 here: https://github.com/syncthing/syncthing/commit/2f3eacdb6c1c33650ccdd91f42e842c116200d92

Mar 05 '24 02:03 olifre

Argh, let me do some coding for Syncthing for once and I add a resource leak. I'll fix it asap (while I should review stuff instead of course, but I wanna fix :) ). Thanks a lot for finding this!

Mar 05 '24 15:03 imsodin

CPU usage increases after folder scans (and hence, over time)

Observation

Reproduction steps

Further information