systemd icon indicating copy to clipboard operation
systemd copied to clipboard

journald got stuck in fsync()

Open jendrikw opened this issue 7 months ago • 2 comments

systemd version the issue has been seen with

257.5-2

Used distribution

Arch

Linux kernel version used

6.14.4-zen1-1-zen

CPU architectures issue was seen on

x86_64

Component

journalctl, systemd-journald

Expected behaviour you didn't see

no crash :)

(I can provide the coredump if needed)

Unexpected behaviour you saw

systemd-journald dumped core

Steps to reproduce the problem

suspend, then wake

Additional program output to the terminal or log subsystem illustrating the issue

#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
#1  0x00007c43992a1fda in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=4294967295, nr=202) at cancellation.c:49
#2  0x00007c43992a264c in __futex_abstimed_wait_common64 (private=128, futex_word=0x7c4398b07990, expected=<optimized out>, op=<optimized out>, abstime=0x0, cancel=true) at futex-internal.c:57
#3  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7c4398b07990, expected=<optimized out>, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128, cancel=cancel@entry=true) at futex-internal.c:87
#4  0x00007c43992a26af in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7c4398b07990, expected=<optimized out>, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128)
    at futex-internal.c:139
#5  0x00007c43992a754c in __pthread_clockjoin_ex (threadid=136629766354624, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:108
#6  0x00007c43992a73d3 in ___pthread_join (threadid=<optimized out>, thread_return=thread_return@entry=0x0) at pthread_join.c:24
#7  0x00007c439966f01f in journal_file_set_offline_thread_join (f=f@entry=0x5fb824a10d10) at ../systemd/src/libsystemd/sd-journal/journal-file.c:191
#8  0x00007c439966f11d in journal_file_set_online (f=f@entry=0x5fb824a10d10) at ../systemd/src/libsystemd/sd-journal/journal-file.c:252
#9  0x00007c43996713f2 in journal_file_append_object (f=f@entry=0x5fb824a10d10, type=type@entry=OBJECT_DATA, size=111, ret_object=ret_object@entry=0x7fffe9a059a0, ret_offset=ret_offset@entry=0x7fffe9a059a8)
    at ../systemd/src/libsystemd/sd-journal/journal-file.c:1238
#10 0x00007c4399672fc6 in journal_file_append_data (f=f@entry=0x5fb824a10d10, data=0x7fffe9a05dd0, size=39, ret_object=ret_object@entry=0x7fffe9a05b40, ret_offset=ret_offset@entry=0x7fffe9a05b48)
    at ../systemd/src/libsystemd/sd-journal/journal-file.c:1856
#11 0x00007c4399673cd5 in journal_file_append_entry (f=f@entry=0x5fb824a10d10, ts=ts@entry=0x7fffe9a05c40, boot_id=<optimized out>, boot_id@entry=0x0, iovec=iovec@entry=0x7fffe9a05e10, n_iovec=n_iovec@entry=11, seqnum=<optimized out>, 
    seqnum_id=<optimized out>, ret_object=<optimized out>, ret_offset=<optimized out>) at ../systemd/src/libsystemd/sd-journal/journal-file.c:2581
#12 0x00005fb7e9ac9252 in server_write_to_journal (s=0x5fb824a08ee0, uid=<optimized out>, iovec=0x7fffe9a05e10, n=<optimized out>, ts=0x7fffe9a05c40, priority=<optimized out>) at ../systemd/src/journal/journald-server.c:998
#13 server_dispatch_message_real (s=0x5fb824a08ee0, iovec=0x7fffe9a05e10, n=<optimized out>, m=<optimized out>, c=<optimized out>, tv=<optimized out>, priority=<optimized out>, object_pid=<optimized out>)
    at ../systemd/src/journal/journald-server.c:1209
#14 0x00005fb7e9acbff2 in dev_kmsg_record (s=0x5fb824a08ee0, p=<optimized out>, l=<optimized out>) at ../systemd/src/journal/journald-kmsg.c:308
#15 server_read_dev_kmsg (s=0x5fb824a08ee0) at ../systemd/src/journal/journald-kmsg.c:343
#16 0x00007c439969f9e2 in source_dispatch (s=s@entry=0x5fb824a0cb10) at ../systemd/src/libsystemd/sd-event/sd-event.c:4228
#17 0x00007c439969fd1e in sd_event_dispatch (e=<optimized out>, e@entry=0x5fb824a09700) at ../systemd/src/libsystemd/sd-event/sd-event.c:4849
#18 0x00007c43996a2f91 in sd_event_run (e=<optimized out>, timeout=timeout@entry=18446744073709551615) at ../systemd/src/libsystemd/sd-event/sd-event.c:4910
#19 0x00005fb7e9abf42d in run (argc=<optimized out>, argv=<optimized out>) at ../systemd/src/journal/journald.c:113
#20 main (argc=<optimized out>, argv=<optimized out>) at ../systemd/src/journal/journald.c:134

jendrikw avatar May 03 '25 12:05 jendrikw

the stacktrace only lists one thread. there must be another that hangs in some form of fsync()...

so my educated guess is that you have some IO scheduling issue (i.e. broken disk that hangs, or a VM that got suspended or so), and we just hang forever in an fsync(), and your distribution leaves the watchdog logic for journald enabled (i.e. WatchdogSec= in the unit file). In that case, journald will eventually be auto-killed because it is not processing messages anymore.

Hence, maybe look into your dmesg, and check if there's something hinting some IO/scheduling/suspending issue at that time. you can tape over the issue by disabling the watchdog logic for journald on your system...

poettering avatar May 05 '25 13:05 poettering

Sounds to me like this issue:

  • https://github.com/systemd/systemd/issues/20284
  • https://access.redhat.com/solutions/5118401
  • https://www.suse.com/support/kb/doc/?id=000019921

 

RedHat describes this as follows:

Root Cause With the new Suspend to Idle sleep method, it's possible for many interrupts to wake the kernel, but not require the system to fully wake from suspend. If this happens the kernel's MONOTONIC clock will restart. Upon wake, when systemd sees the clock has advanced beyond the 3 minute watchdog timeout, it will kill and restart services that have the watchdog enabled.

 

I'm observing this on Debian-13. It often affects systemd-journald. Sometimes systemd-udevd and systemd-timesyncd are also affected.
And most noticeably systemd-logind is affected. And restarting systemd-logind starts a new Xorg server for the login screen (didn't test Wayland), which overrides /tmp/.X11-unix/X0 and prevents opening new programs on the already running Xorg server.

I just ran systemctl edit systemd-logind.service and added the following. And I hope this will prevent restarting the service.

[Service]
WatchdogSec=0

 

In case this is considered a kernel bug as theorized in https://github.com/systemd/systemd/issues/20284#issuecomment-887608147
In short: The thought is, that the kernel shouldn't increase CLOCK_MONOTONIC while in s2idle / s0ix.

This kernel issue seems to match that.
watchdog fires after s2idle - Lenovo X1 Yoga 3rd generation
https://bugzilla.kernel.org/show_bug.cgi?id=200595#c15

@jendrikw I'd recommend posting your problem there.

kolAflash avatar Jun 14 '25 11:06 kolAflash

Closing due to lack of response. (And this appears to be some local IO issue)

poettering avatar Oct 30 '25 00:10 poettering