haproxy icon indicating copy to clipboard operation
haproxy copied to clipboard

infinite loop when shutting down

Open freedge opened this issue 10 months ago • 13 comments

Detailed Description of the Problem

this was originally observed on a haproxy 2.6 deployment in OpenShift. When shutting down, we observe a race condition where the thread with tid=0 waits for other threads to complete, while some remaining threads will loop and take 100% CPU doing syscalls epoll_wait and clock_gettime

Expected Behavior

haproxy should shut down cleanly

Steps to Reproduce the Behavior

  1. as this race condition seems rare, we apply this patch to trigger it
diff --git a/src/haproxy.c b/src/haproxy.c
index be8b587cc..61ac22d85 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -3069,6 +3069,7 @@ void run_poll_loop()
                        _HA_ATOMIC_OR(&th_ctx->flags, TH_FL_SLEEPING);
                        _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_NOTIFIED);
                        __ha_barrier_atomic_store();
+                       if (!signal_queue_len) sleep(tid);
                        if (thread_has_tasks())
  1. make TARGET=linux-glibc USE_OPENSSL=0 USE_ZLIB=0 USE_PCRE=0 USE_LUA=0 USE_EPOLL=1
  2. ./haproxy -f examples/quick-test.cfg -p ${XDG_RUNTIME_DIR}/haproxy.pid -d
  3. while sleep 1 ; do ps -p `pgrep haproxy` -T -o tid,wchan; done
  4. wait for the threads to say "ep_poll"
  5. pkill -USR1 haproxy
  6. wait until there are 4 threads remaining
  7. pkill -USR1 haproxy
  8. feel the laptop heating and burning my legs

Do you have any idea what may have caused this?

I believe some sort of race condition

Do you have an idea how to solve the issue?

No response

What is your configuration?

haproxy from main branch

Output of haproxy -vv

HAProxy version 3.0-dev7-ab7f05-82 2024/04/17 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.14.0-440.frigo.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Apr 16 16:02:50 UTC 2024 x86_64
Build options :
  TARGET  = linux-glibc
  CC      = cc
  CFLAGS  = -O2 -g -fwrapv
  OPTIONS = USE_EPOLL=1 USE_OPENSSL=0 USE_LUA=0 USE_ZLIB=0 USE_PCRE=0
  DEBUG   =

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY -LUA -MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER -OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL -PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ -SSL -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=9).
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built without PCRE or PCRE2 support (using libc's regex instead)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.1 20231218 (Red Hat 11.4.1-3)

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : none

Available filters :
        [BWLIM] bwlim-in
        [BWLIM] bwlim-out                                                                                                                                                                                                                          [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace

Last Outputs and Backtraces

(gdb) thread apply all bt
                                                                                                                                                                                                                                           Thread 4 (Thread 0x7ff0ee7fb640 (LWP 18420) "haproxy"):
#0  0x00007ffdd55e2b95 in ?? ()
#1  0x00007ff0facd3f4d in __GI___clock_gettime (clock_id=clock_id@entry=3, tp=tp@entry=0x7ff0ee7f0b50) at ../sysdeps/unix/sysv/linux/clock_gettime.c:42
#2  0x00000000005ca064 in now_cpu_time () at src/clock.c:91
#3  clock_entering_poll () at src/clock.c:402
#4  0x0000000000423f58 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:227
#5  0x000000000052b506 in run_poll_loop () at src/haproxy.c:3131
#6  0x000000000052bbb7 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3273
#7  0x00007ff0fac89c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#8  0x00007ff0fad0ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
                                                                                                                                                                                                                                           Thread 3 (Thread 0x7ff0eeffc640 (LWP 18419) "haproxy"):
#0  0x00007ffdd55e2b95 in ?? ()
#1  0x00007ff0facd3f4d in __GI___clock_gettime (clock_id=clock_id@entry=3, tp=tp@entry=0x7ff0eeff1b50) at ../sysdeps/unix/sysv/linux/clock_gettime.c:42
#2  0x00000000005ca064 in now_cpu_time () at src/clock.c:91
#3  clock_entering_poll () at src/clock.c:402
#4  0x0000000000423f58 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:227
#5  0x000000000052b506 in run_poll_loop () at src/haproxy.c:3131
#6  0x000000000052bbb7 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3273
#7  0x00007ff0fac89c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#8  0x00007ff0fad0ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
                                                                                                                                                                                                                                           Thread 2 (Thread 0x7ff0f89d9640 (LWP 18416) "haproxy"):
#0  0x00007ffdd55e2b95 in ?? ()
#1  0x00007ff0facd3f4d in __GI___clock_gettime (clock_id=clock_id@entry=3, tp=tp@entry=0x7ff0f89ceb50) at ../sysdeps/unix/sysv/linux/clock_gettime.c:42
#2  0x00000000005ca064 in now_cpu_time () at src/clock.c:91
#3  clock_entering_poll () at src/clock.c:402
#4  0x0000000000423f58 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:227
#5  0x000000000052b506 in run_poll_loop () at src/haproxy.c:3131
#6  0x000000000052bbb7 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3273
#7  0x00007ff0fac89c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#8  0x00007ff0fad0ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
                                                                                                                                                                                                                                           Thread 1 (Thread 0x7ff0faf8d940 (LWP 18412) "haproxy"):
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=18416, futex_word=0x7ff0f89d9910) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ff0f89d9910, expected=18416, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007ff0fac867ff in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ff0f89d9910, expected=<optimized out>, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at futex-internal.c:139
#3  0x00007ff0fac8b6d3 in __pthread_clockjoin_ex (threadid=140672939955776, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:102
#4  0x00007ff0fac8b5a3 in ___pthread_join (threadid=<optimized out>, thread_return=thread_return@entry=0x0) at pthread_join.c:25
#5  0x00000000005a97c3 in wait_for_threads_completion () at src/thread.c:264
#6  0x0000000000420b07 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3991


(gdb) p signal_queue_len
$1 = 1

Additional Information

$ sudo perf stat -e 'syscalls:sys_enter_epoll_wait' -e 'syscalls:sys_enter_clock_gettime' -p `pgrep haproxy` -I 1000
#           time             counts unit events
     1.001077725          2,960,256      syscalls:sys_enter_epoll_wait
     1.001077725          5,920,362      syscalls:sys_enter_clock_gettime
     2.002425052          3,001,638      syscalls:sys_enter_epoll_wait
     2.002425052          6,003,275      syscalls:sys_enter_clock_gettime
     3.003617281          2,951,947      syscalls:sys_enter_epoll_wait
     3.003617281          5,903,900      syscalls:sys_enter_clock_gettime
     4.004911210          3,030,400      syscalls:sys_enter_epoll_wait
     4.004911210          6,060,803      syscalls:sys_enter_clock_gettime
     5.006179439          3,028,230      syscalls:sys_enter_epoll_wait
     5.006179439          6,056,459      syscalls:sys_enter_clock_gettime
^C     5.289030721            854,586      syscalls:sys_enter_epoll_wait
     5.289030721          1,709,160      syscalls:sys_enter_clock_gettime

freedge avatar Apr 17 '24 10:04 freedge

Hi Freedge !

Thanks a lot for reporting this !

I try to reproduce, just to confirm you observe the same with the 3.0-dev7-ab7f05-82 2024/04/17 ?

Many thanks in advance, Regards,

vkssv avatar Apr 17 '24 14:04 vkssv

so for my reproducer I am using the main branch which is commit ab7f05daba5d640b558465eb505e9909b7a5af14 at the moment and indeed reading "3.0-dev7-ab7f05-82" as version. but of course I had to add one line with a sleep to make it fail as intended, I can work on a reproducer with just gdb or stap if you believe there's some value with it..

On OpenShift the problem is reproduced with haproxy26-2.6.13-2.rhaos4.14.el8. The perf stat and gdb backtrace is identical there. It was overall, visible because of the high "system" CPU consumption on the node.

freedge avatar Apr 17 '24 15:04 freedge

Ok, thanks for details !

Steps that you've provided and a patch to have a stable behavior are completely sufficient for the moment, thanks for this !

vkssv avatar Apr 17 '24 15:04 vkssv

Hi Freedge !

Could you share with us, please your glibc version, please (ldd --version) ?

Regards,

vkssv avatar Apr 24 '24 16:04 vkssv

on my centos VM: ldd (GNU libc) 2.34 from glibc-common-2.34-105.el9.x86_64

freedge avatar Apr 24 '24 17:04 freedge

Hi Freedge !

Could you try, please, to compile haproxy with this flag: DEBUG=-DDEBUG_THREAD and reproduce the infinite loop:

make TARGET=linux-glibc USE_OPENSSL=0 USE_ZLIB=0 USE_PCRE=0 USE_LUA=0 USE_EPOLL=1 DEBUG=-DDEBUG_THREAD

This will enable the dump of lock statistics to stderr by the main thread in wait_for_threads_completion () here. This could provide more insights.

Kind regards,

vkssv avatar Apr 29 '24 17:04 vkssv

hi! This should be trivial to reproduce. As per the code you link, the show_lock_stats will only log once the pthread_join completes which is not the case.

but I can trigger this from gdb if you feel it can be any useful:

  3    Thread 0x7f63327fb640 (LWP 15143) "haproxy" 0x00007ffde9b82b95 in ?? ()
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=15142, futex_word=0x7f6332ffc910) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7f6332ffc910, expected=15142, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
    private=private@entry=128, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007f633ee867ff in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f6332ffc910, expected=<optimized out>, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=private@entry=128) at futex-internal.c:139
#3  0x00007f633ee8b6d3 in __pthread_clockjoin_ex (threadid=140064034113088, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
    block=block@entry=true) at pthread_join_common.c:102
#4  0x00007f633ee8b5a3 in ___pthread_join (threadid=<optimized out>, thread_return=thread_return@entry=0x0) at pthread_join.c:25
#5  0x00000000005a0973 in wait_for_threads_completion () at src/thread.c:264
#6  0x0000000000421b97 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3993
(gdb) up
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7f6332ffc910, expected=15142, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
    private=private@entry=128, cancel=cancel@entry=true) at futex-internal.c:87
87        err = __futex_abstimed_wait_common64 (futex_word, expected, op, abstime,
(gdb)
#2  0x00007f633ee867ff in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f6332ffc910, expected=<optimized out>, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=private@entry=128) at futex-internal.c:139
139       return __futex_abstimed_wait_common (futex_word, expected, clockid,
(gdb)
#3  0x00007f633ee8b6d3 in __pthread_clockjoin_ex (threadid=140064034113088, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
    block=block@entry=true) at pthread_join_common.c:102
Downloading source file /usr/src/debug/glibc-2.34-105.el9.x86_64/nptl/pthread_join_common.c...
102               int ret = __futex_abstimed_wait_cancelable64 (
(gdb)
#4  0x00007f633ee8b5a3 in ___pthread_join (threadid=<optimized out>, thread_return=thread_return@entry=0x0) at pthread_join.c:25
Downloading source file /usr/src/debug/glibc-2.34-105.el9.x86_64/nptl/pthread_join.c...
25        return __pthread_clockjoin_ex (threadid, thread_return, 0 /* Ignored */,
(gdb) up
#5  0x00000000005a0973 in wait_for_threads_completion () at src/thread.c:264
264                     pthread_join(ha_pthread[i], NULL);
(gdb) list
259     {
260             int i;
261
262             /* Wait the end of other threads */
263             for (i = 1; i < global.nbthread; i++)
264                     pthread_join(ha_pthread[i], NULL);
265
266     #if defined(DEBUG_THREAD) || defined(DEBUG_FULL)
267             show_lock_stats();
268     #endif
(gdb) p show_lock_stats()
$1 = void
Available filters :
        [BWLIM] bwlim-in
        [BWLIM] bwlim-out
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace
Using epoll() as the polling mechanism.
[WARNING]  (15135) : Proxy p stopped (cumulated conns: FE: 0, BE: 0).
Stats about Lock TASK_RQ: not used
Stats about Lock TASK_WQ: not used
Stats about Lock LISTENER:
         # write lock  : 2
         # write unlock: 2 (0)
         # wait time for write     : 0.000 msec
         # wait time for write/lock: 100.000 nsec
Stats about Lock PROXY:
         # write lock  : 4
         # write unlock: 4 (0)
         # wait time for write     : 0.000 msec
         # wait time for write/lock: 100.000 nsec
Stats about Lock SERVER: not used
Stats about Lock LBPRM: not used
Stats about Lock SIGNALS: not used
Stats about Lock STK_TABLE: not used
Stats about Lock STK_SESS: not used
Stats about Lock APPLETS: not used
Stats about Lock PEER: not used
Stats about Lock SHCTX: not used
Stats about Lock SSL: not used
Stats about Lock SSL_GEN_CERTS: not used
Stats about Lock PATREF: not used
Stats about Lock PATEXP:
         # write lock  : 2
         # write unlock: 2 (0)
         # wait time for write     : 0.000 msec
         # wait time for write/lock: 50.000 nsec
Stats about Lock VARS: not used
Stats about Lock COMP_POOL: not used
Stats about Lock LUA: not used
Stats about Lock NOTIF: not used
Stats about Lock SPOE_APPLET: not used
Stats about Lock DNS:
         # write lock  : 1
         # write unlock: 1 (0)
         # wait time for write     : 0.000 msec
         # wait time for write/lock: 0.000 nsec
Stats about Lock PID_LIST: not used
Stats about Lock EMAIL_ALERTS: not used
Stats about Lock PIPES: not used
Stats about Lock TLSKEYS_REF: not used
Stats about Lock AUTH: not used
Stats about Lock RING: not used
Stats about Lock DICT: not used
Stats about Lock PROTO:                                                                                                                                                                                                                                                                                                           # write lock  : 11
         # write unlock: 11 (0)
         # wait time for write     : 0.001 msec
         # wait time for write/lock: 54.545 nsec
Stats about Lock QUEUE: not used
Stats about Lock CKCH: not used
Stats about Lock SNI: not used
Stats about Lock SSL_SERVER: not used
Stats about Lock SFT: not used
Stats about Lock IDLE_CONNS: not used
Stats about Lock OCSP: not used
Stats about Lock QC_CID: not used
Stats about Lock CACHE: not used
Stats about Lock OTHER: not used
Stats about Lock DEBUG1: not used
Stats about Lock DEBUG2: not used
Stats about Lock DEBUG3: not used
Stats about Lock DEBUG4: not used
Stats about Lock DEBUG5: not used

freedge avatar Apr 29 '24 18:04 freedge

Hi Freedge !

Could you, please, try this patch below and provide us a feedback.

$ git diff
diff --git a/src/haproxy.c b/src/haproxy.c
index 8cc897e75..6850d243e 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -3087,7 +3087,7 @@ void run_poll_loop()
                        if (thread_has_tasks()) {
                                activity[tid].wake_tasks++;
                                _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_SLEEPING);
-                       } else if (signal_queue_len) {
+                       } else if (signal_queue_len && tid == 0) {
                                /* this check is required after setting TH_FL_SLEEPING to avoid
                                 * a race with wakeup on signals using wake_threads() */
                                _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_SLEEPING);

Kind regards,

vkssv avatar May 03 '24 20:05 vkssv

hi, I tried and this indeed fixes the infinite loop.

freedge avatar May 03 '24 21:05 freedge

hi, why is this closed as completed?

freedge avatar May 04 '24 12:05 freedge

can you reopen this? @wlallemand as maintainer of master-worker part maybe?

freedge avatar May 05 '24 16:05 freedge

The fix was merged and will be back-ported soon. So, I suppose we can close this one now.

vkssv avatar May 07 '24 07:05 vkssv

@vkssv please don't close the tickets until the fixes are backported in all branches that are concerned.

wlallemand avatar May 07 '24 10:05 wlallemand