fibers icon indicating copy to clipboard operation
fibers copied to clipboard

High CPU usage on system time change

Open jetomit opened this issue 1 year ago • 7 comments

Since Guix upgraded to guile-fibers 1.3.1, shepherd hangs shortly after boot on systems without a RTC. I believe the problem comes from using get-internal-real-time in the guile-fibers timer wheel implementation. After NTP corrects the system time, this function returns a much larger value, and the CPU load (for one core) goes to 100%.

Profiling suggests the process spends the CPU time in timer-wheel-advance!, so I imagine it is trying to tick through a five-year time diff. I tried increasing the system time manually by N days, which causes shepherd to be unresponsive (e.g. to herd status) for about N×5 seconds. I observed similar behavior with the example from guile-fibers readme.

Replacing all instances of (get-internal-real-time) with (clock-gettime 1) in guile-fibers, and reconfiguring the system with the patched package, fixes this problem. I think using a monotonic clock makes sense, but there is probably a cleaner / more portable way to do it.

Thanks!

jetomit avatar Jun 23 '23 22:06 jetomit

Hi @jetomit!

Using CLOCK_MONOTONIC as you suggest seemed like the right choice to me so I started working on it. However, the API of (fibers timers) as well as schedule-task-at-time expect "internal time units"; changing timer-wheel to use CLOCK_MONOTONIC would affect those interfaces similarly, which is not acceptable.

Instead we should probably change timer-wheel-advance! to cope with large gaps.

@wingo, WDYT?

Thanks!

civodul avatar Jul 16 '23 10:07 civodul

@jetomit Here's a proposed workaround on the Guix side: https://issues.guix.gnu.org/64966

civodul avatar Aug 21 '23 15:08 civodul

Here's a proposed workaround on the Guix side: https://issues.guix.gnu.org/64966

This would work for aarch64, but I also encounter this issue on armhf and x86_64 systems. This happens whenever system time is pushed forward by a significant amount (a day or more), either by ntpd or manually.

As I understand it, guile’s internal-time-units only depends on the platform and is the same for all clock types. The bigger problem with using CLOCK_MONOTONIC might be that it doesn’t count time the system is suspended, which would probably break stuff.

jetomit avatar Aug 25 '23 11:08 jetomit

Another report of shepherd spinning once system time has changed: https://issues.guix.gnu.org/66684

civodul avatar Oct 23 '23 19:10 civodul