syncthing
syncthing copied to clipboard
CPU usage increases after folder scans (and hence, over time)
- Syncthing: v1.27.3, Linux (64-bit Intel/AMD)
- OS: Debian 12, Kernel 6.1.0-18-amd64
- Size: 23.075 files, 1.738 dirs, ~45,6 GiB total size, 6 folders
- Staggered versioning enabled (not sure if relevant)
inotifywatches enabled
Observation
CPU usage is low in general (< 1 % on a small Intel SoC), but keeps growing slowly over time (after >2 days) to 2 % and above, i.e. by over a factor of two (in idle). After prolonged runtime of some weeks, I also observed > 4 %.
Reproduction steps
Either wait ;-), or run many folder rescans. Slowly hitting "Rescan all folders" about 80 times, then waiting until everything has settled down, CPU usage has increased, too. So likely, the regular hourly scan is causing it.
Further information
Initially reported as: https://forum.syncthing.net/t/slowly-increasing-cpu-usage-over-time/21639/2 (until I found a reproducer).
Note I also observe a higher rate of syscalls such as:
[pid 1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 1605] epoll_pwait(4, <unfinished ...>
[pid 962] <... nanosleep resumed>NULL) = 0
[pid 962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=2996919}) = -1 ETIMEDOUT (Connection timed out)
[pid 1605] <... epoll_pwait resumed>[], 128, 13, NULL, 0) = 0
[pid 962] nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 1605] epoll_pwait(4, <unfinished ...>
[pid 962] <... nanosleep resumed>NULL) = 0
[pid 962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=1697811} <unfinished ...>
[pid 1605] <... epoll_pwait resumed>[], 128, 12, NULL, 0) = 0
[pid 962] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1605] epoll_pwait(4, <unfinished ...>
[pid 962] nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 1605] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 1605] epoll_pwait(4, <unfinished ...>
[pid 962] <... nanosleep resumed>NULL) = 0
[pid 962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=3958898} <unfinished ...>
[pid 1605] <... epoll_pwait resumed>[], 128, 14, NULL, 0) = 0
[pid 962] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
in this state, but not sure whether this is really related.
Take a CPU profile?
https://docs.syncthing.net/users/profiling.html
Wow, nice, kudos for this great documentation on how to grab a profile!
gziped them so I could upload here to GitHub:
-
Profile before, i.e. after Syncthing running for an hour or so, almost zero CPU load:: syncthing-cpu-linux-amd64-v1.27.3-215155.pprof.gz
-
Profile after slowly clicking "rescan all folders" for 2 minutes in a row, then waiting 2 minutes for things to settle down: syncthing-cpu-linux-amd64-v1.27.3-215657.pprof.gz
Hope this helps.
Looks like it was running a database GC in the second profile. May have been an unlucky coincidence. Take another one?
Sure, no problem — to decrease chances of anything else interfering, I grabbed three in a row:
The increase seems to be in findRunnable, which could indicate a lot of goroutines (a goroutine leak somewhere). I ran Syncthing for a couple of hours with a 5s rescan on one of my folders, and saw no increase in goroutines at least. (Possibly a little increase in background CPU usage but within the margin of error, small fractions of a percent.) Could be something as simple as memory getting fragmented increase GC work or something, as well.
Interesting, sure, memory fragmentation due to many rescans (after long runtimes) seems quite likely.
The reason why I stumbled upon this is actually since I am performing power consumption monitoring for the home server running Syncthing (and other things), and the x86_64 machine (Intel Pentium Silver N6005) usually uses 3 W after reboot, but was using >4 W after a week, growing towards this value with an almost linear increase over time. Of course, still not a lot, but 30 % increase (which of course also translates in UPS runtime) was reason for the nerd to start investigating, and the power consumption data basically looked as if there was a "leak" (not in memory, but in CPU usage). Since restarting Syncthing brought it back down to 3 W right away, I came here :wink: .
Then I found the increased number of epoll_wait syscalls which Syncthing appears to be doing after prolonged runtime, but maybe that's just the way goroutines interact.
Do you think it is likely that configuration can trigger this? I don't have anything (obviously) special configured apart from staggered versioning, which is probably not related.
Is there a way to dump the number of goroutines at runtime, then I could confirm that this number is also constant for me?
If we don't find the cause, or you think it is not fruitful to delve deeper, that is of course also perfectly fine with me — then I will likely work around this with a daily restart or decrease the interval for full scans (or both) for the system. Thanks in any case for checking and trying to reproduce on your end! (and of course for Syncthing itself, too :wink: )
You can enable the built in profiler by running with --debug-profiler-listen 127.0.0.1:9090. That gives you a web page http://localhost:9090/debug/pprof/ where you can see counts of goroutines and other things, and also take profiles directly with the go profiler if so inclined, like go tool pprof http://localhost:9090/debug/pprof/goroutine or go tool pprof http://localhost:9090/debug/pprof/profile for a CPU profile
There's also the new metrics endpoint that exposes a few things, which I collect on my home server. There I can corroborate idle CPU usage increase over time, clearly:
However it's not accompanied by an increase in goroutines or memory, so it's not a simple leak like that:
It's also not seen as an increase in threads, GC time, file system operations, or open file descriptors (I'll spare you graphs not showing an increase).
Many thanks, that's very enlightening! The built-in profiler which I missed up to now is really great. :heart:
I will also take a look, but I will probably not see anything new, my Go knowledge is very basic at the moment. Still, I feel much better that you are also seeing the effect in general in your monitored metrics, so at least it's not a very exotic configuration on my end.
I can confirm that I also see CPU usage increase after a few days of uptime.
Edit: @calmh: Could you let me know which metric you used for the cpu usage? I can only see process_cpu_seconds_total in the /metrics endpoint but no current cpu usage
For completeness, Alex has also reported this in the forums: https://forum.syncthing.net/t/slowly-increasing-cpu-usage-over-time/21639/4?u=olifre In his case, he reports that this only started to show up for him with 1.22.3, and not with 1.22.2.
Sadly, I don't have enough monitoring history to confirm (I never saw a noticeable increase before, but my monitoring was just worse in the past).
I've run into this problem as well. A walk through the bowels of go's timer internals reveals that there are hundreds of timers with a two second period being processed. With the helpful hints about the scan routines, I found this path missing a timer stop:
--- a/lib/scanner/walk.go
+++ b/lib/scanner/walk.go
@@ -161,6 +161,7 @@ func (w *walker) walk(ctx context.Context) chan ScanResult {
}
if len(filesToHash) == 0 {
+ ticker.Stop()
close(finishedChan)
return
}
@colingibbs1 That's a great find! It also matches Alex' observation in the forums, i.e. that the effect is new with 1.27.3, as that "return path" without a stop of the timer was newly added in 1.27.3 here: https://github.com/syncthing/syncthing/commit/2f3eacdb6c1c33650ccdd91f42e842c116200d92
Argh, let me do some coding for Syncthing for once and I add a resource leak. I'll fix it asap (while I should review stuff instead of course, but I wanna fix :) ). Thanks a lot for finding this!