lxcfs pstack lxcfs process ，get 1000+ threads

we have found our container answer my cmd like "top/ps/free" very slowly, by using pstack attach the lxcfs process，we have got 1000+ threads, and by using strace attach the cmd "top/ps/free", we found that the futex have been locked forever.

pstack info pstack

futex locked info futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x556b5a3a0508, FUTEX_WAIT_PRIVATE, 0, NULL) = 0

Jul 20 '21 12:07 hfeng101

may someone can help please?

Jul 20 '21 12:07 hfeng101

Hm... is this a deadlock in libfuse2? Can you possibly compile against libfuse3 and see whether you can reproduce?

Jul 20 '21 12:07 brauner

Hello @brauner

Im seeing a situation where an openstack management node has panicked due to hung task, with all fuse tasks waiting on fuse_simple_request()->request_wait_answer(),

There are more than 24 thousand lxcfs threads I see in the vmcore. They all seem to be waiting on futex_wait

crash> ps | grep lxcfs | wc -l                   
24387
crash> ps | grep lxd | wc -l  
78
crash>

Jan 04 '23 08:01 nkshirsagar

crash> bt
PID: 1177107  TASK: ffff9aee2f1da000  CPU: 7    COMMAND: "df"
 #0 [ffffb5b65e5e7a70] __schedule at ffffffffb536d0ed
 #1 [ffffb5b65e5e7af8] schedule at ffffffffb536d719
 #2 [ffffb5b65e5e7b18] request_wait_answer at ffffffffb4afd906
 #3 [ffffb5b65e5e7b78] fuse_simple_request at ffffffffb4b0005c
 #4 [ffffb5b65e5e7bb0] fuse_do_getattr at ffffffffb4b01087
 #5 [ffffb5b65e5e7cd8] fuse_getattr at ffffffffb4b02be9
 #6 [ffffb5b65e5e7d10] vfs_getattr_nosec at ffffffffb498920a
 #7 [ffffb5b65e5e7d20] vfs_getattr at ffffffffb4989267
 #8 [ffffb5b65e5e7d50] vfs_statx at ffffffffb4989309
 #9 [ffffb5b65e5e7da8] __do_sys_newstat at ffffffffb498989e
#10 [ffffb5b65e5e7e58] __x64_sys_newstat at ffffffffb49898f6
#11 [ffffb5b65e5e7e68] do_syscall_64 at ffffffffb535f259
#12 [ffffb5b65e5e7e88] exit_to_user_mode_prepare at ffffffffb476a1bc
#13 [ffffb5b65e5e7ec8] syscall_exit_to_user_mode at ffffffffb5363957
#14 [ffffb5b65e5e7ee0] do_syscall_64 at ffffffffb535f269
#15 [ffffb5b65e5e7f10] irqentry_exit_to_user_mode at ffffffffb53639e9
#16 [ffffb5b65e5e7f20] irqentry_exit at ffffffffb5363a1d
#17 [ffffb5b65e5e7f30] sysvec_apic_timer_interrupt at ffffffffb5362b6e
#18 [ffffb5b65e5e7f50] entry_SYSCALL_64_after_hwframe at ffffffffb5400099
    RIP: 00007f7004c274ba  RSP: 00007ffc14cdbcf8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 000055be101148b0  RCX: 00007f7004c274ba
    RDX: 00007ffc14cdbd90  RSI: 00007ffc14cdbd90  RDI: 000055be10111f40
    RBP: 000055be10111ee0   R8: 0000000000000001   R9: 000000000000007c
    R10: ffffffffffffffff  R11: 0000000000000246  R12: 000055be10111e20
    R13: 000055be10111fa0  R14: 0000000010111e00  R15: 00007ffc14cdbd70
    ORIG_RAX: 0000000000000004  CS: 0033  SS: 002b
crash>

reproducer - (hello.c from libfuse examples but with a getattr that simply does a huge sleep() effectively making fuse daemon "unresponsive")

static int hello_getattr(const char *path, struct stat *stbuf)
{
int res = 0;
sleep(1000000);
return res;
}

We mount the fuse fs (with the shell in which we start the daemon added to a cgroup with a low pid limit), (reproducer folder is the new folder i created in the /sys/fs/cgroup) and then ran in it,

echo "+pids" > cgroup.subtree_control
echo pid_of_the_shell > cgroup.procs (echo $$)
echo 10000 | tee pids.max

Start the daemon,

root@z-rotomvm03:/sys/fs/cgroup/reproducer# /home/ubuntu/fuse_fs/hello -d -f /home/ubuntu/for_mount/
FUSE library version: 2.9.9
nullpath_ok: 0
nopath: 0
utime_omit_ok: 0
unique: 2, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
INIT: 7.34
flags=0x33fffffb
max_readahead=0x00020000
INIT: 7.19
flags=0x00000011
max_readahead=0x00020000
max_write=0x00020000
max_background=0
congestion_threshold=0
unique: 2, success, outsize: 40

and then from another shell send more than that pids.max number of "stat" calls, like

for i in $(seq 15000); do echo $i; stat ../for_mount & done

We see "fuse: error creating thread: Resource temporarily unavailable" errors from the fuse daemon, especially if i mount 2 different fuse filesystems at the same time in different locations (presumably thats what the openstack mgmt server does, mounting lxcfs filesystems for each container?)

A df issued after it was hung a long time with tasks in S+ showed the traces - ( https://paste.ubuntu.com/p/r6kYhXXdxJ/ ), somehow all the blocked tasks went to UN state after we issued the df. I am not sure if its related but it happened a few minutes after the df was issued, whereas it was hung without blocked UN tasks for 24 hours without the kernel threads going to UN unkillable state)

vmcores from the hung server (not my reproducer) shows a lot of mtail tasks in the kernel.. 1 vmcore had 25k+ lxcfs tasks.. lots of mtail (and apache) tasks are also seen.. crash> ps | grep mtail | wc -l 3361

another vmcore had 4202 such threads of lxcfs in the kernel at time of the panic..

crash> ps | grep lxcfs | wc -l 4202

Jan 16 '23 07:01 nkshirsagar

Thanks @nkshirsagar Kernel crashdump that you've provided was really helpful. https://github.com/lxc/lxcfs/pull/579

Of course it's just a fix for recently added bug, but it doesn't fix the original issue. So I think we need to keep this issue opened, because it's most likely caused by something inside libfuse.

Jan 16 '23 11:01 mihalicyn

I think we can close this as fixed.

cc @stgraber

Sep 29 '23 10:09 mihalicyn