varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

deadlock (resulting in big quit) from vcl_BackendEvent() racing vcldir_retire()

Open nigoroll opened this issue 1 year ago • 0 comments

Expected Behavior

.

Current Behavior

seen while trying to reproduce https://github.com/nigoroll/libvmod-dynamic/issues/117:

This is similar to the problem fixed by #4048, but here the situation is not to add a director, but rather to remove it:

A director thread (updating backends) is racing a CLI cold event which has a pthread_join() for the director thread. It never completes because the thread is waiting for the vcl_mtx, which the CLI event holds because it is returning the last director reference.

Thread 218 (Thread 0x7f84bb7ff6c0 (LWP 462388)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f84cf422b88) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7f84cf422b88, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f84cfcab2d2 in lll_mutex_lock_optimized (mutex=0x7f84cf422b88) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x7f84cf422b88) at ./nptl/pthread_mutex_lock.c:93
#4  0x0000563f69cb8f64 in Lck__Lock (lck=0x563f69e2f908 <vcl_mtx>, p=0x563f69dbf89a "vcldir_retire", l=279) at cache/cache_lck.c:124
#5  0x0000563f69cfa71a in vcldir_retire (vdir=0x7f84bdc53f80) at cache/cache_vrt_vcl.c:279
#6  0x0000563f69cf5df3 in vcldir_deref (vdir=0x7f84bdc53f80) at cache/cache_vrt_vcl.c:305
#7  0x0000563f69cf5fbe in VRT_Assign_Backend (dst=0x7f84bdc14af0, src=0x0) at cache/cache_vrt_vcl.c:343
#8  0x00007f84c2fa81f0 in ref_del (ctx=<optimized out>, r=0x7f84bdc14ad0) at vmod_dynamic.c:428
#9  0x00007f84c2faa176 in dom_update (now=1721895825.5200291, priv=<optimized out>, res=0x7f84c2fbaba0 <res_getdns>, dom=<optimized out>) at vmod_dynamic.c:74
2
#10 dom_lookup_thread (priv=0x7f84ba616400) at vmod_dynamic.c:820
#11 0x00007f84cfca8044 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007f84cfd2861c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f84cfc1b1c0 (LWP 457084)):
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007f84cfca9d9f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f84cfc5af32 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f84cfc45472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000563f69cc0eac in pan_ic (func=0x563f69db7986 "cli_quit", file=0x563f69db7545 "cache/cache_main.c", line=376, cond=0x563f69db79b4 "It's time for the b
ig quit", kind=VAS_WRONG) at cache/cache_panic.c:860
#5  0x0000563f69d87135 in VAS_Fail (func=0x563f69db7986 "cli_quit", file=0x563f69db7545 "cache/cache_main.c", line=376, cond=0x563f69db79b4 "It's time for the
 big quit", kind=VAS_WRONG) at vas.c:67
#6  0x0000563f69cbb3d2 in cli_quit (sig=3) at cache/cache_main.c:376
#7  <signal handler called>
#8  0x00007f84cfca4da4 in __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=462388, futex_word=0x7f84bb7ff990) at ./nptl
/futex-internal.c:57
#9  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7f84bb7ff990, expected=462388, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=pri
vate@entry=128, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#10 0x00007f84cfca4e0b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f84bb7ff990, expected=<optimized out>, clockid=clockid@entry
=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#11 0x00007f84cfca9b33 in __pthread_clockjoin_ex (threadid=140208058136256, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at ./nptl/pthread_join_common.c:102
#12 0x00007f84cfca99df in ___pthread_join (threadid=<optimized out>, thread_return=thread_return@entry=0x0) at ./nptl/pthread_join.c:24
#13 0x00007f84c2fa8623 in dom_event (dir=<optimized out>, ev=<optimized out>) at vmod_dynamic.c:1051
#14 0x0000563f69c9683d in VDI_Event (d=0x7f84ba6059f0, ev=VCL_EVENT_COLD) at cache/cache_director.c:277
#15 0x0000563f69cd9fbc in vcl_BackendEvent (vcl=0x7f84b9f16240, e=VCL_EVENT_COLD) at cache/cache_vcl.c:449
#16 0x0000563f69cd9428 in vcl_set_state (vcl=0x7f84b9f16240, state=0x563f69df22ee "0", msg=0x7ffc888a5528) at cache/cache_vcl.c:597
#17 0x0000563f69cd8eb4 in VCL_Poll () at cache/cache_vcl.c:729
#18 0x0000563f69c90ea9 in cli_cb_before (cli=0x7f84cf4d2810) at cache/cache_cli.c:80
#19 0x0000563f69d8b359 in cls_exec (cfd=0x7f84cf4d27e0, av=0x7f84cf42c400, ac=2) at vcli_serve.c:303
#20 0x0000563f69d8a2c6 in cls_feed (cfd=0x7f84cf4d27e0, p=0x7ffc888a5672 "\nold\n727.1721895820.531743/vgc.so 1auto\nLIBC_2.2.5", e=0x7ffc888a5673 "old\n727.1721895820.531743/vgc.so 1auto\nLIBC_2.2.5") at vcli_serve.c:412
#21 0x0000563f69d89d8a in VCLS_Poll (cs=0x7f84cf422b30, cli=0x7f84cf4d2810, timeout=-1) at vcli_serve.c:617
#22 0x0000563f69c90cf2 in CLI_Run () at cache/cache_cli.c:110
#23 0x0000563f69cbb0e5 in child_main (sigmagic=1, altstksz=81920) at cache/cache_main.c:467
#24 0x0000563f69d285fb in mgt_launch_child (cli=0x0) at mgt/mgt_child.c:402
#25 0x0000563f69d28091 in MCH_Start_Child () at mgt/mgt_child.c:692
#26 0x0000563f69d2f502 in main (argc=9, argv=0x7ffc888a7af8) at mgt/mgt_main.c:959


(gdb) thr 218
[Switching to thread 218 (Thread 0x7f84bb7ff6c0 (LWP 462388))]
#0  futex_wait (private=0, expected=2, futex_word=0x7f84cf422b88) at ../sysdeps/nptl/futex-internal.h:146
146	../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) fr 3
#3  ___pthread_mutex_lock (mutex=0x7f84cf422b88) at ./nptl/pthread_mutex_lock.c:93
93	./nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) p *mutex
$1 = {__data = {__lock = 2, __count = 0, __owner = 457084, __nusers = 1, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
  __size = "\002\000\000\000\000\000\000\000|\371\006\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 2}

Possible Solution

I see no obvious solution to this problem at this point.

Steps to Reproduce (for bugs)

No response

Context

.

Varnish Cache version

No response

Operating system

No response

Source of binary packages used (if any)

No response

nigoroll avatar Jul 25 '24 08:07 nigoroll