glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

fuse_interrupt_finish_fop blocks all glfs_epoll threads and glfs_timer thread

Open jiankyu opened this issue 2 months ago • 0 comments

Description of problem: We use glusterfs in production, recently we frequently received issue reports from a specific customer, that his application got stuck at "D" state. We were able to capture the live scene and attach gdb to the glusterfs fuse daemon, here is the observation:

  • The 2 (default number) glfs_epoll threads got stuck in the fuse_interrupt_finish_fop() function.

  • The glfs_timer thread was also stuck in this function.

To be accurate, they were all blocked by the pthread_cond_wait call at this line.

At this point, all fresh fops failed with "Transport endpoint is not connected" error, the fuse daemon was unable to recover. The only way to recover from this state is to kill the fuse daemon and remount.

The exact command to reproduce the issue: I wrote a testing program to steadily reproduce this issue. No special set up is needed, just build the project, and run the binary following the instructions in the README file.

The full output of the command that failed: After a few "Enter" keystrokes, the testing program will freeze, this is the sign that this issue is reproduced. Attach a gdb to the fuse process, switch to the glfs_epoll000 and glfs_epoll001 threads, the "bt" command will show the same stacktrace (mine is generated from glusterfs v10.4 debug build)

#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5580a54419f0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5580a54419f8, cond=0x5580a54419c8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x5580a54419c8, mutex=mutex@entry=0x5580a54419f8) at pthread_cond_wait.c:647
#3  0x00007fcf006db64e in fuse_interrupt_finish_fop (frame=frame@entry=0x5580a5441de8, this=this@entry=0x5580a5170028, sync=sync@entry=true, datap=datap@entry=0x7fcefdaea740) at fuse-bridge.c:843
#4  0x00007fcf006e93ea in fuse_setlk_cbk (frame=frame@entry=0x5580a5441de8, cookie=<optimized out>, this=0x5580a5170028, op_ret=op_ret@entry=-1, op_errno=op_errno@entry=4, lock=lock@entry=0x7fcefdaea910, xdata=0x0) at fuse-bridge.c:4753
#5  0x00007fcefc935d1e in io_stats_lk_cbk (frame=frame@entry=0x5580a5440fe8, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=-1, op_errno=op_errno@entry=4, lock=lock@entry=0x7fcefdaea910, xdata=0x0) at io-stats.c:2476
#6  0x00007fcf0264e45b in default_lk_cbk (frame=0x5580a5265288, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=4, lock=0x7fcefdaea910, xdata=0x0) at defaults.c:1425
#7  0x00007fcefcab5f8f in client4_0_lk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x5580a5267528) at client-rpc-fops_v2.c:2218
#8  0x00007fcf0256dce6 in rpc_clnt_handle_reply (clnt=clnt@entry=0x5580a515c568, pollin=pollin@entry=0x5580a519a1a8) at rpc-clnt.c:765
#9  0x00007fcf0256e03d in rpc_clnt_notify (trans=0x5580a51570a8, mydata=0x5580a515c598, event=<optimized out>, data=0x5580a519a1a8) at rpc-clnt.c:932
#10 0x00007fcf0256aa4a in rpc_transport_notify (this=this@entry=0x5580a51570a8, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x5580a519a1a8) at rpc-transport.c:520
#11 0x00007fcefdbae1fc in socket_event_poll_in_async (xl=<optimized out>, async=async@entry=0x5580a519a2c0) at socket.c:2507
#12 0x00007fcefdbb3e71 in gf_async (cbk=0x7fcefdbae1d0 <socket_event_poll_in_async>, xl=<optimized out>, async=0x5580a519a2c0) at ../../../../libglusterfs/src/glusterfs/async.h:191
#13 socket_event_poll_in (notify_handled=true, this=0x5580a51570a8) at socket.c:2548
#14 socket_event_handler (event_thread_died=0 '\000', poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x5580a51570a8, gen=7, idx=2, fd=<optimized out>) at socket.c:2939
#15 socket_event_handler (fd=fd@entry=9, idx=idx@entry=2, gen=gen@entry=7, data=data@entry=0x5580a51570a8, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') at socket.c:2859
#16 0x00007fcf0262386f in event_dispatch_epoll_handler (event=0x7fcefdaeb054, event_pool=0x5580a5142028) at event-epoll.c:638
#17 event_dispatch_epoll_worker (data=0x5580a5151128) at event-epoll.c:749
#18 0x00007fcf02527609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#19 0x00007fcf02278133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Once the epoll threads are blocked, the client receiving direction is completely clogged, all fresh IO requests on this volume will hang. After 42 seconds of the reproduction, the client.ping-timer expires, (ping responses are also clogged), we'll see log message like this:

[2024-05-31 16:16:25.289333 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-vol1-client-0: server 10.84.2.118:56542 has not responded in the last 42 seconds, disconnecting.

The expiration of ping timer will set the client transport state to "disconnected", and the attempt to reconnect would fail because no thread to pick up the response. after this time, all fresh IO requests on this volume will immediately fail with "Transport endpoint not connected".

After 30 minutes, the client frame bailout timer expires, it starts to bail out pending frames, this testing program produces one more "interrupted setlk" fop than the number of epoll threads, and the "setlk" frame will be unwinded prior to the interruption frame, when that "setlk" frame is unwinded, it blocks the timer thread in the same way as how it blocked the epoll threads. Below is the stack trace of the timer thread:

#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5580a52646b8, cond=0x5580a5264688) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x5580a5264688, mutex=mutex@entry=0x5580a52646b8) at pthread_cond_wait.c:647
#3  0x00007fcf006db64e in fuse_interrupt_finish_fop (frame=frame@entry=0x5580a5264488, this=this@entry=0x5580a5170028, sync=sync@entry=true, datap=datap@entry=0x7fcf006cb7d0) at fuse-bridge.c:843
#4  0x00007fcf006e93ea in fuse_setlk_cbk (frame=frame@entry=0x5580a5264488, cookie=<optimized out>, this=0x5580a5170028, op_ret=op_ret@entry=-1, op_errno=op_errno@entry=107, lock=lock@entry=0x7fcf006cb9a0, xdata=0x0) at fuse-bridge.c:4753
#5  0x00007fcefc935d1e in io_stats_lk_cbk (frame=frame@entry=0x5580a52657c8, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=-1, op_errno=op_errno@entry=107, lock=lock@entry=0x7fcf006cb9a0, xdata=0x0) at io-stats.c:2476
#6  0x00007fcf0264e45b in default_lk_cbk (frame=0x5580a5265988, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107, lock=0x7fcf006cb9a0, xdata=0x0) at defaults.c:1425
#7  0x00007fcefcab5f8f in client4_0_lk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x5580a52663a8) at client-rpc-fops_v2.c:2218
#8  0x00007fcf0256ed36 in call_bail (data=0x5580a515c568) at rpc-clnt.c:187
#9  0x00007fcf025d6b5d in gf_timer_proc (data=0x5580a515e0d8) at timer.c:153
#10 0x00007fcf02527609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007fcf02278133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

At this time, the remaining pending frames (which include the interruption to those related fops that blocks the epoll and timer threads) have no chance to be unwinded, the client is completely stuck: receiving is clogged, state machine (triggered by timers) is completely down, transport is disconnected. All fresh IO fails, the setlk requests issued from the testing program can't complete, which make the testing program in "D" state.

We can also set the frame timeout to a shorter value to observe the timer thread hang faster.

Expected results:

Mandatory info: This is a code problem in the fuse xlator, I would say it has nothing to do with the volume layout, we just need a gluserfs volume, and a client mount point to reproduce this issue.

- The output of the gluster volume status command: Ditto.

- The output of the gluster volume heal command: Ditto.

**- Provide logs present on following locations of client and server nodes -

**- Is there any crash ? Provide the backtrace and coredump No, this is a deadlock problem. It can easily be reproduced, backtrace and coredump can be collected easily.

Additional info: The "devel" branch has a commit to disable the "setlk interruption" feature, however, this commit has not yet been merged to any release version. Maybe we should do so, before there is a fundamental fix to resolve the problem.

- The operating system / glusterfs version: Linux - Unbuntu 22.04 LTS glusterfs v10.4

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

jiankyu avatar May 31 '24 18:05 jiankyu