haproxy icon indicating copy to clipboard operation
haproxy copied to clipboard

haproxy 2.9.5 (solaris) external-check command go in infinite loop

Open geerttouquet opened this issue 11 months ago • 14 comments

Detailed Description of the Problem

If I start the haproxy server with an external check, the check will run in infinite loop. It is running on solaris Remark haproxy server 1.6.5 don't have this problem

my exteranl script root@m8hasprod /usr/local/haproxy/conf> more /usr/local/haproxy/bin/checkFileExistPRODOFF_MANAGEMENT_test.bash #!/usr/bin/bash VIP=$1 VPT=$2 RIP=$3 #vip en vpt not used file="/usr/local/haproxy/healthchecks/PRODOFF_MANAGEMENT/"$RIP".offline" if test -e $file; then exit 1 fi echo 1 >> /usr/local/tmp/check exit 0 haproxy configuration : see below

in, the file /usr/local/tmp/check , there will be many 1 in the file (inifinte loop-

the debug output of the haproxy server

root@m8hasprod /usr/local/haproxy/conf> /usr/local/haproxy/bin/haproxy-2.9.5 -f /usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2486.m8hasprod.cfg -d [NOTICE] (27997) : haproxy version is 2.9.5-260dbb8 [NOTICE] (27997) : path to executable is /usrlocal/haproxy/bin/haproxy-2.9.5 [WARNING] (27997) : config : parsing [/usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2486.m8hasprod.cfg:46] : backend 'bk_PRODOFF_MANAGEMENT' : 'option tcplog' directive is ignored in backends. Available polling systems : evports : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use evports.

Available filters : [BWLIM] bwlim-in [BWLIM] bwlim-out [CACHE] cache [COMP] compression [FCGI] fcgi-app [SPOE] spoe [TRACE] trace Using evports() as the polling mechanism. [WARNING] (27997) : kill 107579 [WARNING] (27997) : Server bk_PRODOFF_MANAGEMENT/prod_hasoff_1 is DOWN, reason: External check timeout, code: 0, check duration: 10000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (27997) : kill 107606 [WARNING] (27997) : Server bk_PRODOFF_MANAGEMENT/prod_zaraoff_1 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [ALERT] (27997) : backend 'bk_PRODOFF_MANAGEMENT' has no server available! [WARNING] (27997) : Server bk_PRODOFF_MANAGEMENT/prod_hasoff_1 is UP, reason: External check passed, code: 0, check duration: 10001ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [WARNING] (27997) : Server bk_PRODOFF_MANAGEMENT/prod_zaraoff_1 is UP, reason: External check passed, code: 0, check duration: 10001ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

Expected Behavior

The check will be only executed once every 10 seconds and not in an infinite loop

Steps to Reproduce the Behavior

set an external check command

Do you have any idea what may have caused this?

the cause is the external check, If I comment out the external check, it don't use cpu anymory

Do you have an idea how to solve the issue?

No response

What is your configuration?

part of the haproxy configuration

backend bk_PRODOFF_MANAGEMENT
  mode tcp
  balance first
  log global
  option tcplog
  default-server maxconn 5000
  option external-check
  external-check command /usr/local/haproxy/bin/checkFileExistPRODOFF_MANAGEMENT_test.bash
 # tcp-check connect port 2050
 # up moet 3 keer goed zijn
# voor inter test 10s anders 30s
# offload servers moeten er nog voor staan !!!  nu default naar PRODDEF
  default-server inter 10s rise 2 fall 1 on-marked-down shutdown-sessions
  server prod_hasoff_1 prodhasoff1:2402 weight 10 check inter 10s
  server prod_zaraoff_1 prodzaraoff1:2402 weight 8  check inter 10s

Output of haproxy -vv

root@m8hasprod /usr/local/haproxy/conf>  /usr/local/haproxy/bin/haproxy-2.9.5 -vv
HAProxy version 2.9.5-260dbb8 2024/02/15 - https://haproxy.org/
Status: stable branch - will stop receiving fixes around Q1 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.9.5.html
Build options :
  TARGET  = solaris
  CPU     = ultrasparc
  CC      = /usr/bin/gcc
  CFLAGS  = -O6 -mcpu=v9 -mtune=ultrasparc -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment -DFD_SETSIZE=65536 -D_REENTRANT -D_XOPEN_SOURCE=600 -D__EXTENSIONS__
  OPTIONS = USE_OPENSSL=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : -51DEGREES -ACCEPT4 -BACKTRACE +CLOSEFROM -CPU_AFFINITY +CRYPT_H -DEVICEATLAS -DL -ENGINE -EPOLL +EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT -LINUX_CAP -LINUX_SPLICE -LINUX_TPROXY -LUA -MATH -MEMORY_PROFILING -NETFILTER -NS +OBSOLETE_LINKER +OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL -PRCTL -PROCCTL -PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT -SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 -SYSTEMD -TFO +THREAD -THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=1).
Built with gcc compiler version 13.2.0
Encrypted password support via crypt(3): yes
Built without PCRE or PCRE2 support (using libc's regex instead)
Built with transparent proxy support using:
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2zi  1 Aug 2023
Running on OpenSSL version : OpenSSL 1.0.2zi  1 Aug 2023
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2

Available polling systems :
    evports : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use evports.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG

Available services : none

Available filters :
        [BWLIM] bwlim-in
        [BWLIM] bwlim-out
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace

root@m8hasprod /usr/local/haproxy/conf>

Last Outputs and Backtraces

No response

Additional Information

No response

geerttouquet avatar Mar 11 '24 13:03 geerttouquet

I cannot reproduce on my Linux system. Maybe it is linked to evports polling system ? Can you try to rerun your binary with the extra argument -dv to disable it ? Also it may be interesting to have an overview of the rest of your config, or at least the global section.

a-denoyelle avatar Mar 11 '24 15:03 a-denoyelle

the complete configuration
global
        log 127.0.0.1   local6 notice
        user sa_haproxy
        group inf
        daemon
        external-check

        # Default SSL material locations
        # Default ciphers to use on SSL-enabled listening sockets.
        # For more information, see ciphers(1SSL). This list is from:
        #  https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
#       ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!D
SS
        # geen ECDH omdat niet ondersteund is op solaris en wel in de toekomst
        ssl-default-bind-ciphers kEDH+aRSA+AESGCM:kEDH+aRSA+AES
        ssl-default-bind-options no-sslv3
        tune.ssl.default-dh-param 2048
        ca-base /usr/local/haproxy/certs
        crt-base /usr/local/haproxy/private
#       nbproc 31
        maxconn 30000
defaults
        log     global
        log-tag PRODOFF_MANAGEMENT
        mode    tcp
        option  tcplog
        option  dontlognull
        timeout connect 120s
        timeout client  24h
        timeout server  24h

frontend ft_PRODOFF_MANAGEMENT
  mode tcp
  bind prodserverssl:2486 ssl crt PROD.pem
  bind 10.254.10.30:8286 ssl crt PROD.pem
  maxconn 30000
  option clitcpka
  log global
  option tcplog
  tcp-request inspect-delay 2s
  default_backend bk_PRODOFF_MANAGEMENT

backend bk_PRODOFF_MANAGEMENT
  mode tcp
  balance first
  log global
  option tcplog
  default-server maxconn 5000
 option external-check
  external-check command /usr/local/haproxy/bin/checkFileExistPRODOFF_MANAGEMENT.bash
 # tcp-check connect port 2050
 # up moet 3 keer goed zijn
# voor inter test 10s anders 30s
# offload servers moeten er nog voor staan !!!  nu default naar PRODDEF
  default-server inter 10s rise 2 fall 1 on-marked-down shutdown-sessions
  server prod_hasoff_1 prodhasoff1:2402 weight 10 check inter 10s
  server prod_zaraoff_1 prodzaraoff1:2402 weight 8 check inter 10s
  server prod_imhooff_1 prodimhooff1:2402 weight 6 check inter 10s
  server prod_ro prodro:4095 weight 5  backup


listen stats #Listen on localhost port 9000
   bind prodserverssl:9013
    mode http
    stats enable #Enable statistics
     stats refresh 10s
    # stats hide-version #Hide HAPRoxy version, a necessity for any public-facing site

 

geerttouquet avatar Mar 11 '24 16:03 geerttouquet

/usr/local/haproxy/bin/haproxy-2.9.5 -f /usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2486.m8hasprod.cfg -dv

geerttouquet avatar Mar 11 '24 16:03 geerttouquet

Sorry your last comment appears to be incomplete. Did you try to rerun this command-line with the extra -dv ? If so, did you observe the same behavior or not regarding external-check ?

a-denoyelle avatar Mar 11 '24 16:03 a-denoyelle

I have run with option -dv and it works fine now the check will be done as described in the configuration file it don't loop anymore

thanks for the very quick response

geerttouquet avatar Mar 12 '24 13:03 geerttouquet

Thanks, but now we need to find out why. Reopening.

wtarreau avatar Mar 12 '24 14:03 wtarreau

If needed, I can test it

geerttouquet avatar Mar 13 '24 07:03 geerttouquet

We tried to reproduce the issue in a virtualized environment but to no avail for now. I used a solaris 10 environment but could not find anything superior to gcc 4.4. Due to several build incompatibilities, I was only able to build with DEBUG_THREAD deactivated, but as said I did not reproduce the issue. Can you try on your side to run with global option nbthread 1 and tell me if the issue is still present ?

a-denoyelle avatar Mar 14 '24 17:03 a-denoyelle

I have add the following in the configuration

global
        log 127.0.0.1   local6 notice
        user sa_haproxy
        group inf
        daemon
        external-check
        nbthread 1

the cpu goes from 14 seconds to 48 seconds cpu used in 60 seconds . There are no request done on this proxy (see below) root@m8hasprod ~> ps auxww |grep 2484;sleep 60; ps auxww |grep 2484 sa_hapro 89478 0.1 0.0 29936 18776 ? S 11:45:18 0:14 /usr/local/haproxy/bin/haproxy-2.9.5 -f /usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2484.m8hasprod_aangepast.cfg root 92769 0.0 0.0 2800 2208 pts/4 S 11:45:42 0:00 grep 2484 sa_hapro 89478 0.1 0.0 29936 14680 ? O 11:45:18 0:48 /usr/local/haproxy/bin/haproxy-2.9.5 -f /usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2484.m8hasprod_aangepast.cfg

geerttouquet avatar Mar 15 '24 10:03 geerttouquet

Thanks. Now we've reinstalled a dual-CPU sparc under solaris 10. Hopefully we'll be closer to your environment to try to figure what could be happening.

wtarreau avatar Apr 12 '24 07:04 wtarreau

I don't understnad yet what's happening, but at first glance, when there's a single server, it's checked at the right speed, and when there is more than one, they're checked in loops. I really don't understand :-/ I suspect that something is kept between both checks but it's hard to figure what. And indeed, starting with -dv makes the problem go away!

wtarreau avatar Apr 12 '24 13:04 wtarreau

What matters is that now we have a local reproducer ;-)

wtarreau avatar Apr 12 '24 13:04 wtarreau

good news that you can reproduce it :-)

geerttouquet avatar Apr 12 '24 14:04 geerttouquet

Hi! I could finally find the issue and fix it. It has been bogus since day one of ev_ports, but for the author's defense, the API is awkward. I just again forgot to mention the issue number in the commit message. You can pick commit 36d92dcd9b right now, but it will be backported. If you're dealing with moderate loads, I found what looked like a leftover of a debugging session in that code, that limits to 1 fd the number of events handled per polling loop, which can consume quite a bunch of CPU even at moderate loads. I addressed it by this commit: e6662bf706 (which I don't intend to backport as it could possibly uncover other issues that were hidden because of this limitation).

wtarreau avatar Apr 17 '24 15:04 wtarreau

2.0 is EOL, I'm closing now.

capflam avatar Jul 05 '24 07:07 capflam