pstack lxcfs process ,get 1000+ threads
we have found our container answer my cmd like "top/ps/free" very slowly, by using pstack attach the lxcfs process,we have got 1000+ threads, and by using strace attach the cmd "top/ps/free", we found that the futex have been locked forever.
pstack info

futex locked info futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
may someone can help please?
Hm... is this a deadlock in libfuse2? Can you possibly compile against libfuse3 and see whether you can reproduce?
Hello @brauner
Im seeing a situation where an openstack management node has panicked due to hung task, with all fuse tasks waiting on fuse_simple_request()->request_wait_answer(),
There are more than 24 thousand lxcfs threads I see in the vmcore. They all seem to be waiting on futex_wait
crash> ps | grep lxcfs | wc -l
24387
crash> ps | grep lxd | wc -l
78
crash>
crash> bt
PID: 1177107 TASK: ffff9aee2f1da000 CPU: 7 COMMAND: "df"
#0 [ffffb5b65e5e7a70] __schedule at ffffffffb536d0ed
#1 [ffffb5b65e5e7af8] schedule at ffffffffb536d719
#2 [ffffb5b65e5e7b18] request_wait_answer at ffffffffb4afd906
#3 [ffffb5b65e5e7b78] fuse_simple_request at ffffffffb4b0005c
#4 [ffffb5b65e5e7bb0] fuse_do_getattr at ffffffffb4b01087
#5 [ffffb5b65e5e7cd8] fuse_getattr at ffffffffb4b02be9
#6 [ffffb5b65e5e7d10] vfs_getattr_nosec at ffffffffb498920a
#7 [ffffb5b65e5e7d20] vfs_getattr at ffffffffb4989267
#8 [ffffb5b65e5e7d50] vfs_statx at ffffffffb4989309
#9 [ffffb5b65e5e7da8] __do_sys_newstat at ffffffffb498989e
#10 [ffffb5b65e5e7e58] __x64_sys_newstat at ffffffffb49898f6
#11 [ffffb5b65e5e7e68] do_syscall_64 at ffffffffb535f259
#12 [ffffb5b65e5e7e88] exit_to_user_mode_prepare at ffffffffb476a1bc
#13 [ffffb5b65e5e7ec8] syscall_exit_to_user_mode at ffffffffb5363957
#14 [ffffb5b65e5e7ee0] do_syscall_64 at ffffffffb535f269
#15 [ffffb5b65e5e7f10] irqentry_exit_to_user_mode at ffffffffb53639e9
#16 [ffffb5b65e5e7f20] irqentry_exit at ffffffffb5363a1d
#17 [ffffb5b65e5e7f30] sysvec_apic_timer_interrupt at ffffffffb5362b6e
#18 [ffffb5b65e5e7f50] entry_SYSCALL_64_after_hwframe at ffffffffb5400099
RIP: 00007f7004c274ba RSP: 00007ffc14cdbcf8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 000055be101148b0 RCX: 00007f7004c274ba
RDX: 00007ffc14cdbd90 RSI: 00007ffc14cdbd90 RDI: 000055be10111f40
RBP: 000055be10111ee0 R8: 0000000000000001 R9: 000000000000007c
R10: ffffffffffffffff R11: 0000000000000246 R12: 000055be10111e20
R13: 000055be10111fa0 R14: 0000000010111e00 R15: 00007ffc14cdbd70
ORIG_RAX: 0000000000000004 CS: 0033 SS: 002b
crash>
reproducer - (hello.c from libfuse examples but with a getattr that simply does a huge sleep() effectively making fuse daemon "unresponsive")
static int hello_getattr(const char *path, struct stat *stbuf)
{
int res = 0;
sleep(1000000);
return res;
}
We mount the fuse fs (with the shell in which we start the daemon added to a cgroup with a low pid limit), (reproducer folder is the new folder i created in the /sys/fs/cgroup) and then ran in it,
echo "+pids" > cgroup.subtree_control
echo pid_of_the_shell > cgroup.procs (echo $$)
echo 10000 | tee pids.max
Start the daemon,
root@z-rotomvm03:/sys/fs/cgroup/reproducer# /home/ubuntu/fuse_fs/hello -d -f /home/ubuntu/for_mount/
FUSE library version: 2.9.9
nullpath_ok: 0
nopath: 0
utime_omit_ok: 0
unique: 2, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
INIT: 7.34
flags=0x33fffffb
max_readahead=0x00020000
INIT: 7.19
flags=0x00000011
max_readahead=0x00020000
max_write=0x00020000
max_background=0
congestion_threshold=0
unique: 2, success, outsize: 40
and then from another shell send more than that pids.max number of "stat" calls, like
for i in $(seq 15000); do echo $i; stat ../for_mount & done
We see "fuse: error creating thread: Resource temporarily unavailable" errors from the fuse daemon, especially if i mount 2 different fuse filesystems at the same time in different locations (presumably thats what the openstack mgmt server does, mounting lxcfs filesystems for each container?)
A df issued after it was hung a long time with tasks in S+ showed the traces - ( https://paste.ubuntu.com/p/r6kYhXXdxJ/ ), somehow all the blocked tasks went to UN state after we issued the df. I am not sure if its related but it happened a few minutes after the df was issued, whereas it was hung without blocked UN tasks for 24 hours without the kernel threads going to UN unkillable state)
vmcores from the hung server (not my reproducer) shows a lot of mtail tasks in the kernel.. 1 vmcore had 25k+ lxcfs tasks.. lots of mtail (and apache) tasks are also seen.. crash> ps | grep mtail | wc -l 3361
another vmcore had 4202 such threads of lxcfs in the kernel at time of the panic..
crash> ps | grep lxcfs | wc -l 4202
Thanks @nkshirsagar Kernel crashdump that you've provided was really helpful. https://github.com/lxc/lxcfs/pull/579
Of course it's just a fix for recently added bug, but it doesn't fix the original issue. So I think we need to keep this issue opened, because it's most likely caused by something inside libfuse.
I think we can close this as fixed.
cc @stgraber