haproxy
haproxy copied to clipboard
haproxy 2.9.5 (solaris) external-check command go in infinite loop
Detailed Description of the Problem
If I start the haproxy server with an external check, the check will run in infinite loop. It is running on solaris Remark haproxy server 1.6.5 don't have this problem
my exteranl script root@m8hasprod /usr/local/haproxy/conf> more /usr/local/haproxy/bin/checkFileExistPRODOFF_MANAGEMENT_test.bash #!/usr/bin/bash VIP=$1 VPT=$2 RIP=$3 #vip en vpt not used file="/usr/local/haproxy/healthchecks/PRODOFF_MANAGEMENT/"$RIP".offline" if test -e $file; then exit 1 fi echo 1 >> /usr/local/tmp/check exit 0 haproxy configuration : see below
in, the file /usr/local/tmp/check , there will be many 1 in the file (inifinte loop-
the debug output of the haproxy server
root@m8hasprod /usr/local/haproxy/conf> /usr/local/haproxy/bin/haproxy-2.9.5 -f /usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2486.m8hasprod.cfg -d [NOTICE] (27997) : haproxy version is 2.9.5-260dbb8 [NOTICE] (27997) : path to executable is /usrlocal/haproxy/bin/haproxy-2.9.5 [WARNING] (27997) : config : parsing [/usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2486.m8hasprod.cfg:46] : backend 'bk_PRODOFF_MANAGEMENT' : 'option tcplog' directive is ignored in backends. Available polling systems : evports : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use evports.
Available filters : [BWLIM] bwlim-in [BWLIM] bwlim-out [CACHE] cache [COMP] compression [FCGI] fcgi-app [SPOE] spoe [TRACE] trace Using evports() as the polling mechanism. [WARNING] (27997) : kill 107579 [WARNING] (27997) : Server bk_PRODOFF_MANAGEMENT/prod_hasoff_1 is DOWN, reason: External check timeout, code: 0, check duration: 10000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (27997) : kill 107606 [WARNING] (27997) : Server bk_PRODOFF_MANAGEMENT/prod_zaraoff_1 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [ALERT] (27997) : backend 'bk_PRODOFF_MANAGEMENT' has no server available! [WARNING] (27997) : Server bk_PRODOFF_MANAGEMENT/prod_hasoff_1 is UP, reason: External check passed, code: 0, check duration: 10001ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [WARNING] (27997) : Server bk_PRODOFF_MANAGEMENT/prod_zaraoff_1 is UP, reason: External check passed, code: 0, check duration: 10001ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Expected Behavior
The check will be only executed once every 10 seconds and not in an infinite loop
Steps to Reproduce the Behavior
set an external check command
Do you have any idea what may have caused this?
the cause is the external check, If I comment out the external check, it don't use cpu anymory
Do you have an idea how to solve the issue?
No response
What is your configuration?
part of the haproxy configuration
backend bk_PRODOFF_MANAGEMENT
mode tcp
balance first
log global
option tcplog
default-server maxconn 5000
option external-check
external-check command /usr/local/haproxy/bin/checkFileExistPRODOFF_MANAGEMENT_test.bash
# tcp-check connect port 2050
# up moet 3 keer goed zijn
# voor inter test 10s anders 30s
# offload servers moeten er nog voor staan !!! nu default naar PRODDEF
default-server inter 10s rise 2 fall 1 on-marked-down shutdown-sessions
server prod_hasoff_1 prodhasoff1:2402 weight 10 check inter 10s
server prod_zaraoff_1 prodzaraoff1:2402 weight 8 check inter 10s
Output of haproxy -vv
root@m8hasprod /usr/local/haproxy/conf> /usr/local/haproxy/bin/haproxy-2.9.5 -vv
HAProxy version 2.9.5-260dbb8 2024/02/15 - https://haproxy.org/
Status: stable branch - will stop receiving fixes around Q1 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.9.5.html
Build options :
TARGET = solaris
CPU = ultrasparc
CC = /usr/bin/gcc
CFLAGS = -O6 -mcpu=v9 -mtune=ultrasparc -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment -DFD_SETSIZE=65536 -D_REENTRANT -D_XOPEN_SOURCE=600 -D__EXTENSIONS__
OPTIONS = USE_OPENSSL=1
DEBUG = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS
Feature list : -51DEGREES -ACCEPT4 -BACKTRACE +CLOSEFROM -CPU_AFFINITY +CRYPT_H -DEVICEATLAS -DL -ENGINE -EPOLL +EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT -LINUX_CAP -LINUX_SPLICE -LINUX_TPROXY -LUA -MATH -MEMORY_PROFILING -NETFILTER -NS +OBSOLETE_LINKER +OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL -PRCTL -PROCCTL -PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT -SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 -SYSTEMD -TFO +THREAD -THREAD_DUMP +TPROXY -WURFL -ZLIB
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=1).
Built with gcc compiler version 13.2.0
Encrypted password support via crypt(3): yes
Built without PCRE or PCRE2 support (using libc's regex instead)
Built with transparent proxy support using:
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2zi 1 Aug 2023
Running on OpenSSL version : OpenSSL 1.0.2zi 1 Aug 2023
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Available polling systems :
evports : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use evports.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
<default> : mode=TCP side=FE|BE mux=PASS flags=
none : mode=TCP side=FE|BE mux=PASS flags=NO_UPG
<default> : mode=HTTP side=FE|BE mux=H1 flags=HTX
h1 : mode=HTTP side=FE|BE mux=H1 flags=HTX|NO_UPG
fcgi : mode=HTTP side=BE mux=FCGI flags=HTX|HOL_RISK|NO_UPG
h2 : mode=HTTP side=FE|BE mux=H2 flags=HTX|HOL_RISK|NO_UPG
Available services : none
Available filters :
[BWLIM] bwlim-in
[BWLIM] bwlim-out
[CACHE] cache
[COMP] compression
[FCGI] fcgi-app
[SPOE] spoe
[TRACE] trace
root@m8hasprod /usr/local/haproxy/conf>
Last Outputs and Backtraces
No response
Additional Information
No response
I cannot reproduce on my Linux system. Maybe it is linked to evports polling system ? Can you try to rerun your binary with the extra argument -dv
to disable it ? Also it may be interesting to have an overview of the rest of your config, or at least the global section.
the complete configuration
global
log 127.0.0.1 local6 notice
user sa_haproxy
group inf
daemon
external-check
# Default SSL material locations
# Default ciphers to use on SSL-enabled listening sockets.
# For more information, see ciphers(1SSL). This list is from:
# https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
# ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!D
SS
# geen ECDH omdat niet ondersteund is op solaris en wel in de toekomst
ssl-default-bind-ciphers kEDH+aRSA+AESGCM:kEDH+aRSA+AES
ssl-default-bind-options no-sslv3
tune.ssl.default-dh-param 2048
ca-base /usr/local/haproxy/certs
crt-base /usr/local/haproxy/private
# nbproc 31
maxconn 30000
defaults
log global
log-tag PRODOFF_MANAGEMENT
mode tcp
option tcplog
option dontlognull
timeout connect 120s
timeout client 24h
timeout server 24h
frontend ft_PRODOFF_MANAGEMENT
mode tcp
bind prodserverssl:2486 ssl crt PROD.pem
bind 10.254.10.30:8286 ssl crt PROD.pem
maxconn 30000
option clitcpka
log global
option tcplog
tcp-request inspect-delay 2s
default_backend bk_PRODOFF_MANAGEMENT
backend bk_PRODOFF_MANAGEMENT
mode tcp
balance first
log global
option tcplog
default-server maxconn 5000
option external-check
external-check command /usr/local/haproxy/bin/checkFileExistPRODOFF_MANAGEMENT.bash
# tcp-check connect port 2050
# up moet 3 keer goed zijn
# voor inter test 10s anders 30s
# offload servers moeten er nog voor staan !!! nu default naar PRODDEF
default-server inter 10s rise 2 fall 1 on-marked-down shutdown-sessions
server prod_hasoff_1 prodhasoff1:2402 weight 10 check inter 10s
server prod_zaraoff_1 prodzaraoff1:2402 weight 8 check inter 10s
server prod_imhooff_1 prodimhooff1:2402 weight 6 check inter 10s
server prod_ro prodro:4095 weight 5 backup
listen stats #Listen on localhost port 9000
bind prodserverssl:9013
mode http
stats enable #Enable statistics
stats refresh 10s
# stats hide-version #Hide HAPRoxy version, a necessity for any public-facing site
/usr/local/haproxy/bin/haproxy-2.9.5 -f /usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2486.m8hasprod.cfg -dv
Sorry your last comment appears to be incomplete. Did you try to rerun this command-line with the extra -dv
? If so, did you observe the same behavior or not regarding external-check ?
I have run with option -dv and it works fine now the check will be done as described in the configuration file it don't loop anymore
thanks for the very quick response
Thanks, but now we need to find out why. Reopening.
If needed, I can test it
We tried to reproduce the issue in a virtualized environment but to no avail for now. I used a solaris 10 environment but could not find anything superior to gcc 4.4. Due to several build incompatibilities, I was only able to build with DEBUG_THREAD
deactivated, but as said I did not reproduce the issue. Can you try on your side to run with global option nbthread 1
and tell me if the issue is still present ?
I have add the following in the configuration
global
log 127.0.0.1 local6 notice
user sa_haproxy
group inf
daemon
external-check
nbthread 1
the cpu goes from 14 seconds to 48 seconds cpu used in 60 seconds . There are no request done on this proxy (see below) root@m8hasprod ~> ps auxww |grep 2484;sleep 60; ps auxww |grep 2484 sa_hapro 89478 0.1 0.0 29936 18776 ? S 11:45:18 0:14 /usr/local/haproxy/bin/haproxy-2.9.5 -f /usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2484.m8hasprod_aangepast.cfg root 92769 0.0 0.0 2800 2208 pts/4 S 11:45:42 0:00 grep 2484 sa_hapro 89478 0.1 0.0 29936 14680 ? O 11:45:18 0:48 /usr/local/haproxy/bin/haproxy-2.9.5 -f /usr/local/haproxy/conf/haproxy_PRODOFF_MANAGEMENT_2484.m8hasprod_aangepast.cfg
Thanks. Now we've reinstalled a dual-CPU sparc under solaris 10. Hopefully we'll be closer to your environment to try to figure what could be happening.
I don't understnad yet what's happening, but at first glance, when there's a single server, it's checked at the right speed, and when there is more than one, they're checked in loops. I really don't understand :-/ I suspect that something is kept between both checks but it's hard to figure what. And indeed, starting with -dv makes the problem go away!
What matters is that now we have a local reproducer ;-)
good news that you can reproduce it :-)
Hi! I could finally find the issue and fix it. It has been bogus since day one of ev_ports, but for the author's defense, the API is awkward. I just again forgot to mention the issue number in the commit message. You can pick commit 36d92dcd9b right now, but it will be backported. If you're dealing with moderate loads, I found what looked like a leftover of a debugging session in that code, that limits to 1 fd the number of events handled per polling loop, which can consume quite a bunch of CPU even at moderate loads. I addressed it by this commit: e6662bf706 (which I don't intend to backport as it could possibly uncover other issues that were hidden because of this limitation).
2.0 is EOL, I'm closing now.