haproxy
haproxy copied to clipboard
All threads hung suddenly
Detailed Description of the Problem
Dual socket intel mainboard baremetal and haproxy running in docker official image overlayed with custom volumes. It hungs every few hours. No traffic pattern found, only some handshake previous (not immediate) failures that we don't know if related (error:0A0000C1:SSL routines ...) as they don't show just before the hung.
We fixed cgroups in docker container to pin cpus to one of the sockets. Same behaviour. We also upgraded from 2.9 alpine to 3.0.4 alpine. Same behaviour.
Expected Behavior
Should not crash every few hours.
Steps to Reproduce the Behavior
Not sure, we've got the same system up in maybe 40 production systems and only hungs on this dual socket baremetal.
Do you have any idea what may have caused this?
Dual socket bare metal was the first thing we thought as it's the only production system with dual socket. But after pinning the cpus we are not sure about that. Now we think about the ssl cipher logs like this one's, but also don't seem to be always just before the hung and we think haproxy should not hung because of that:
{"time": "[09/Sep/2024:17:55:06.806]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
aaa.bbb.ccc.ddd:37726 [09/Sep/2024:17:55:48.786] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37734 [09/Sep/2024:17:55:49.013] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37748 [09/Sep/2024:17:55:49.247] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37762 [09/Sep/2024:17:55:49.490] fe_secured/1: SSL handshake failure (error:0A0000C1:SSL routines::no shared cipher)
aaa.bbb.ccc.ddd:37768 [09/Sep/2024:17:55:49.723] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37782 [09/Sep/2024:17:55:49.957] fe_secured/1: SSL handshake failure (error:0A000102:SSL routines::unsupported protocol)
aaa.bbb.ccc.ddd:37792 [09/Sep/2024:17:55:50.181] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37802 [09/Sep/2024:17:55:50.399] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37812 [09/Sep/2024:17:55:50.616] fe_secured/1: SSL handshake failure (error:0A0000C1:SSL routines::no shared cipher)
aaa.bbb.ccc.ddd:37828 [09/Sep/2024:17:55:50.842] fe_secured/1: SSL handshake failure (error:0A00006C:SSL routines::bad key share)
{"time": "[09/Sep/2024:17:57:07.236]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
{"time": "[09/Sep/2024:17:57:10.013]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
{"time": "[09/Sep/2024:17:59:10.442]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
{"time": "[09/Sep/2024:17:59:13.239]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
Thread 6 is about to kill the process.
Thread 1 : id=0x7f846acf6b80 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/1 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17392664141 now=17392672661 diff=8520
curr_task=0
Thread 2 : id=0x7f8462a86b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/2 stuck=0 prof=0 harmless=0
Do you have an idea how to solve the issue?
We're opening the issue after all our knowledge has been applied, so no.
What is your configuration?
It's build on runtime from a base config (https://gitlab.com/isard/isardvdi/-/tree/main/docker/haproxy/cfg/_base?ref_type=heads) and a portal config (https://gitlab.com/isard/isardvdi/-/tree/main/docker/haproxy/cfg/portal?ref_type=heads)
Output of haproxy -vv
/var/lib/haproxy # haproxy -vv
HAProxy version 3.0.4-7a59afa 2024/09/03 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2029.
Known bugs: http://www.haproxy.org/bugs/bugs-3.0.4.html
Running on: Linux 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64
Build options :
TARGET = linux-musl
CC = cc
CFLAGS = -O2 -g -fwrapv
OPTIONS = USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_PROMEX=1 USE_PCRE2=1 USE_PCRE2_JIT=1
DEBUG =
Feature list : -51DEGREES +ACCEPT4 -BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 -SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=11).
Built with OpenSSL version : OpenSSL 3.3.2 3 Sep 2024
Running on OpenSSL version : OpenSSL 3.3.2 3 Sep 2024
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
OpenSSL providers loaded : default
Built with Lua version : Lua 5.4.6
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.43 2024-02-16
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 13.2.1 20240309
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE|BE mux=H2 flags=HTX|HOL_RISK|NO_UPG
<default> : mode=HTTP side=FE|BE mux=H1 flags=HTX
h1 : mode=HTTP side=FE|BE mux=H1 flags=HTX|NO_UPG
fcgi : mode=HTTP side=BE mux=FCGI flags=HTX|HOL_RISK|NO_UPG
<default> : mode=TCP side=FE|BE mux=PASS flags=
none : mode=TCP side=FE|BE mux=PASS flags=NO_UPG
Available services : prometheus-exporter
Available filters :
[BWLIM] bwlim-in
[BWLIM] bwlim-out
[CACHE] cache
[COMP] compression
[FCGI] fcgi-app
[SPOE] spoe
[TRACE] trace
### Last Outputs and Backtraces
```plain
{"time": "[09/Sep/2024:17:55:06.806]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
aaa.bbb.ccc.ddd:37726 [09/Sep/2024:17:55:48.786] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37734 [09/Sep/2024:17:55:49.013] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37748 [09/Sep/2024:17:55:49.247] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37762 [09/Sep/2024:17:55:49.490] fe_secured/1: SSL handshake failure (error:0A0000C1:SSL routines::no shared cipher)
aaa.bbb.ccc.ddd:37768 [09/Sep/2024:17:55:49.723] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37782 [09/Sep/2024:17:55:49.957] fe_secured/1: SSL handshake failure (error:0A000102:SSL routines::unsupported protocol)
aaa.bbb.ccc.ddd:37792 [09/Sep/2024:17:55:50.181] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37802 [09/Sep/2024:17:55:50.399] fe_secured/1: SSL handshake failure
aaa.bbb.ccc.ddd:37812 [09/Sep/2024:17:55:50.616] fe_secured/1: SSL handshake failure (error:0A0000C1:SSL routines::no shared cipher)
aaa.bbb.ccc.ddd:37828 [09/Sep/2024:17:55:50.842] fe_secured/1: SSL handshake failure (error:0A00006C:SSL routines::bad key share)
{"time": "[09/Sep/2024:17:57:07.236]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
{"time": "[09/Sep/2024:17:57:10.013]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
{"time": "[09/Sep/2024:17:59:10.442]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
{"time": "[09/Sep/2024:17:59:13.239]", "src":"xxx.yyy.zzz.dddd", "method":"-", "status": "400", "uri":"-", "backend":"<NOSRV>", "blk":"-"}
Thread 6 is about to kill the process.
Thread 1 : id=0x7f846acf6b80 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/1 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17392664141 now=17392672661 diff=8520
curr_task=0
Thread 2 : id=0x7f8462a86b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/2 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17863712721 now=17863738819 diff=26098
curr_task=0
Thread 3 : id=0x7f8462a60b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/3 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19456346062 now=19456357955 diff=11893
curr_task=0
Thread 4 : id=0x7f8462a3ab30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/4 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19854456768 now=19854481308 diff=24540
curr_task=0
Thread 5 : id=0x7f8462a14b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/5 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19268374453 now=19268387842 diff=13389
curr_task=0
*>Thread 6 : id=0x7f84629eeb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/6 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=15973715134 now=18057967888 diff=2084252754
curr_task=0
Thread 7 : id=0x7f84629c8b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/7 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17073054816 now=17073073231 diff=18415
curr_task=0
>Thread 8 : id=0x7f8462991b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/8 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=15686555906 now=17771051465 diff=2084495559
curr_task=0
Thread 9 : id=0x7f846296bb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/9 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=20340321487 now=20340342848 diff=21361
curr_task=0
Thread 10: id=0x7f8462544b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/10 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=23831957954 now=23831978540 diff=20586
curr_task=0
Thread 11: id=0x7f846251eb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/11 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=19418388549 now=19418450605 diff=62056
curr_task=0
Thread 12: id=0x7f84624f8b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/12 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18852556448 now=18852582158 diff=25710
curr_task=0
Thread 13: id=0x7f84624d2b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/13 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=20101980935 now=20102007335 diff=26400
curr_task=0
Thread 14: id=0x7f84624acb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/14 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19492749587 now=19492768489 diff=18902
curr_task=0
Thread 15: id=0x7f8462486b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/15 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=20144955454 now=20144961597 diff=6143
curr_task=0
Thread 16: id=0x7f8462460b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/16 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=21167868117 now=21167901675 diff=33558
curr_task=0
Thread 17: id=0x7f846243ab30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/17 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=20664332446 now=20664345402 diff=12956
curr_task=0
Thread 18: id=0x7f8462414b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/18 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=17759715772 now=17759751273 diff=35501
curr_task=0
Thread 19: id=0x7f84623eeb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/19 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=16155798108 now=16155807164 diff=9056
curr_task=0
Thread 20: id=0x7f84623c8b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/20 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19482348556 now=19482359025 diff=10469
curr_task=0
Thread 21: id=0x7f84623a2b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/21 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18404493444 now=18404510247 diff=16803
curr_task=0
Thread 22: id=0x7f846237cb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/22 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18810191259 now=18810204098 diff=12839
curr_task=0
Thread 23: id=0x7f8462356b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/23 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18009091871 now=18009106545 diff=14674
curr_task=0
Thread 24: id=0x7f8462330b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/24 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19558687168 now=19558712826 diff=25658
curr_task=0
Thread 25: id=0x7f846230ab30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/25 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=16159602060 now=16159621167 diff=19107
curr_task=0
Thread 26: id=0x7f84622e4b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/26 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18469737155 now=18469773656 diff=36501
curr_task=0
Thread 27: id=0x7f84622beb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/27 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18019207168 now=18019233557 diff=26389
curr_task=0
Thread 28: id=0x7f8462298b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/28 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19019248126 now=19019259698 diff=11572
curr_task=0
Thread 29: id=0x7f8462272b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/29 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=26638625789 now=26638643994 diff=18205
curr_task=0
Thread 30: id=0x7f846224cb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/30 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19261693721 now=19261713019 diff=19298
curr_task=0
Thread 31: id=0x7f8462226b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/31 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17499073586 now=17499109075 diff=35489
curr_task=0
Thread 32: id=0x7f8462200b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/32 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18314061510 now=18314088676 diff=27166
curr_task=0
Thread 33: id=0x7f84621dab30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/33 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17696132502 now=17696144613 diff=12111
curr_task=0
Thread 34: id=0x7f84621b4b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/34 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17016557217 now=17016580036 diff=22819
curr_task=0
Thread 35: id=0x7f846218eb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/35 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=20119763935 now=20119791822 diff=27887
curr_task=0
Thread 36: id=0x7f8462168b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/36 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=11626890042 now=11626955439 diff=65397
curr_task=0
>Thread 37: id=0x7f8462142b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/37 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=17423161966 now=19512142176 diff=2088980210
curr_task=0
Thread 38: id=0x7f846211cb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/38 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=15240226093 now=15240254815 diff=28722
curr_task=0
Thread 39: id=0x7f84620f6b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/39 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18720872406 now=18720886731 diff=14325
curr_task=0
Thread 40: id=0x7f84620d0b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/40 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18789025637 now=18789047975 diff=22338
curr_task=0
Thread 41: id=0x7f84620aab30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/41 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=11290917542 now=11290946700 diff=29158
curr_task=0
Thread 42: id=0x7f8462084b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/42 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=18609473473 now=18609492218 diff=18745
curr_task=0
Thread 43: id=0x7f846205eb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/43 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=12052830268 now=12053296699 diff=466431
curr_task=0
Thread 44: id=0x7f8462038b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/44 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19073303400 now=19073314746 diff=11346
curr_task=0
Thread 45: id=0x7f8462012b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/45 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19434380526 now=19434393058 diff=12532
curr_task=0
Thread 46: id=0x7f8461fecb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/46 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17838093849 now=17838113517 diff=19668
curr_task=0
Thread 47: id=0x7f8461fc6b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/47 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19530099046 now=19530112927 diff=13881
curr_task=0
Thread 48: id=0x7f8461fa0b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/48 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19485675666 now=19485710513 diff=34847
curr_task=0
Thread 49: id=0x7f8461f7ab30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/49 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=10665383895 now=10666806257 diff=1422362
curr_task=0
Thread 50: id=0x7f8461f54b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/50 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19785081140 now=19785089174 diff=8034
curr_task=0
>Thread 51: id=0x7f8461f2eb30 act=1 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/51 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=13705601968 now=15820069079 diff=2114467111
curr_task=0x7f846963b370 (task) calls=16245 last=0
fct=0x56159d719040(process_resolvers) ctx=0x7f846aaabf20
Thread 52: id=0x7f8461f08b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/52 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=18805003927 now=18805039287 diff=35360
curr_task=0
Thread 53: id=0x7f8461ee2b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/53 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=26693809734 now=26693827871 diff=18137
curr_task=0
Thread 54: id=0x7f8461ebcb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/54 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19326259217 now=19326275294 diff=16077
curr_task=0
Thread 55: id=0x7f8461e96b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/55 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19740827281 now=19740836356 diff=9075
curr_task=0
Thread 56: id=0x7f8461e70b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/56 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=19101388072 now=19101413072 diff=25000
curr_task=0
Thread 57: id=0x7f8461e4ab30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/57 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=19086979764 now=19087031411 diff=51647
curr_task=0
Thread 58: id=0x7f8461e24b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/58 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19257566136 now=19257583947 diff=17811
curr_task=0
Thread 59: id=0x7f8461dfeb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/59 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=25630271225 now=25630305838 diff=34613
curr_task=0
Thread 60: id=0x7f8461dd8b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/60 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=25187814040 now=25187824429 diff=10389
curr_task=0
Thread 61: id=0x7f8461db2b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/61 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=17412300842 now=17412317232 diff=16390
curr_task=0
Thread 62: id=0x7f8461d8cb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/62 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19551231430 now=19551250701 diff=19271
curr_task=0
Thread 63: id=0x7f8461d66b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/63 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=19694079511 now=19694092589 diff=13078
curr_task=0
Thread 64: id=0x7f8461d40b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/64 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=16717283510 now=16717316940 diff=33430
curr_task=0
HAProxy version 3.0.4-7a59afa 2024/09/03 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2029.
Known bugs: http://www.haproxy.org/bugs/bugs-3.0.4.html
Running on: Linux 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64
[NOTICE] (1) : haproxy version is 3.0.4-7a59afa
[ALERT] (1) : Current worker (38) exited with code 134 (Aborted)
[WARNING] (1) : A worker process unexpectedly died and this can only be explained by a bug in haproxy or its dependencies.
Please check that you are running an up to date and maintained version of haproxy and open a bug report.
[ALERT] (1) : exit-on-failure: killing every processes with SIGTERM
### Additional Information
_No response_
Sorry, last logs were with unpinned cpus, but doesn't matter, pretty sure we will have the same bug in a few hours and now it's pinned. I'll post it as soon as it happens.
On dual-socket systems, the communication between the two sockets is way slower than the local one, causing massive unfairness that can result in some threads being totally stuck trying to obtain a lock and never managing to. You should absolutely involve thread groups. Here you need two thread groups (thread-groups 2), and you'll need to manually bind your threads of the first group to the CPUs of the first socket, and similarly for the second group, using the cpu-map directive. Please check how the CPUs are spread between sockets using lscpu -e. I guess you'll have CPU 0-15,32-47 on socket 1 (hence group 1) and 16-31,48-63 on socket 2 (hence group 2).
We started some work already to make this automatically configurable for the most common cases, but it's a long and tedious task to automate (many corner cases).
Was one of our main guesses, the dual socket. I see that there is the thread-groups to group them and be used each one in one of the sockets. But should'nt be the bug avoided if we mapped container cpus to one of the sockets using cgroups?
This is our lscpu -e:
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 3200.0000 800.0000
1 0 0 1 1:1:1:0 yes 3200.0000 800.0000
2 0 0 2 2:2:2:0 yes 3200.0000 800.0000
3 0 0 3 3:3:3:0 yes 3200.0000 800.0000
4 0 0 4 4:4:4:0 yes 3200.0000 800.0000
5 0 0 5 5:5:5:0 yes 3200.0000 800.0000
6 0 0 6 6:6:6:0 yes 3200.0000 800.0000
7 0 0 7 7:7:7:0 yes 3200.0000 800.0000
8 0 0 8 8:8:8:0 yes 3200.0000 800.0000
9 0 0 9 9:9:9:0 yes 3200.0000 800.0000
10 0 0 10 10:10:10:0 yes 3200.0000 800.0000
11 0 0 11 11:11:11:0 yes 3200.0000 800.0000
12 0 0 12 12:12:12:0 yes 3200.0000 800.0000
13 0 0 13 13:13:13:0 yes 3200.0000 800.0000
14 0 0 14 14:14:14:0 yes 3200.0000 800.0000
15 0 0 15 15:15:15:0 yes 3200.0000 800.0000
16 0 0 16 16:16:16:0 yes 3200.0000 800.0000
17 0 0 17 17:17:17:0 yes 3200.0000 800.0000
18 0 0 18 18:18:18:0 yes 3200.0000 800.0000
19 0 0 19 19:19:19:0 yes 3200.0000 800.0000
20 0 0 20 20:20:20:0 yes 3200.0000 800.0000
21 0 0 21 21:21:21:0 yes 3200.0000 800.0000
22 0 0 22 22:22:22:0 yes 3200.0000 800.0000
23 0 0 23 23:23:23:0 yes 3200.0000 800.0000
24 0 0 24 24:24:24:0 yes 3200.0000 800.0000
25 0 0 25 25:25:25:0 yes 3200.0000 800.0000
26 0 0 26 26:26:26:0 yes 3200.0000 800.0000
27 0 0 27 27:27:27:0 yes 3200.0000 800.0000
28 0 0 28 28:28:28:0 yes 3200.0000 800.0000
29 0 0 29 29:29:29:0 yes 3200.0000 800.0000
30 0 0 30 30:30:30:0 yes 3200.0000 800.0000
31 0 0 31 31:31:31:0 yes 3200.0000 800.0000
32 1 1 32 32:32:32:1 yes 3200.0000 800.0000
33 1 1 33 33:33:33:1 yes 3200.0000 800.0000
34 1 1 34 34:34:34:1 yes 3200.0000 800.0000
35 1 1 35 35:35:35:1 yes 3200.0000 800.0000
36 1 1 36 36:36:36:1 yes 3200.0000 800.0000
37 1 1 37 37:37:37:1 yes 3200.0000 800.0000
38 1 1 38 38:38:38:1 yes 3200.0000 800.0000
39 1 1 39 39:39:39:1 yes 3200.0000 800.0000
40 1 1 40 40:40:40:1 yes 3200.0000 800.0000
41 1 1 41 41:41:41:1 yes 3200.0000 800.0000
42 1 1 42 42:42:42:1 yes 3200.0000 800.0000
43 1 1 43 43:43:43:1 yes 3200.0000 800.0000
44 1 1 44 44:44:44:1 yes 3200.0000 800.0000
45 1 1 45 45:45:45:1 yes 3200.0000 800.0000
46 1 1 46 46:46:46:1 yes 3200.0000 800.0000
47 1 1 47 47:47:47:1 yes 3200.0000 800.0000
48 1 1 48 48:48:48:1 yes 3200.0000 800.0000
49 1 1 49 49:49:49:1 yes 3200.0000 800.0000
50 1 1 50 50:50:50:1 yes 3200.0000 800.0000
51 1 1 51 51:51:51:1 yes 3200.0000 800.0000
52 1 1 52 52:52:52:1 yes 3200.0000 800.0000
53 1 1 53 53:53:53:1 yes 3200.0000 800.0000
54 1 1 54 54:54:54:1 yes 3200.0000 800.0000
55 1 1 55 55:55:55:1 yes 3200.0000 800.0000
56 1 1 56 56:56:56:1 yes 3200.0000 800.0000
57 1 1 57 57:57:57:1 yes 3200.0000 800.0000
58 1 1 58 58:58:58:1 yes 3200.0000 800.0000
59 1 1 59 59:59:59:1 yes 3200.0000 800.0000
60 1 1 60 60:60:60:1 yes 3200.0000 800.0000
61 1 1 61 61:61:61:1 yes 3200.0000 800.0000
62 1 1 62 62:62:62:1 yes 3200.0000 800.0000
63 1 1 63 63:63:63:1 yes 3200.0000 800.0000
64 0 0 0 0:0:0:0 yes 3200.0000 800.0000
65 0 0 1 1:1:1:0 yes 3200.0000 800.0000
66 0 0 2 2:2:2:0 yes 3200.0000 800.0000
67 0 0 3 3:3:3:0 yes 3200.0000 800.0000
68 0 0 4 4:4:4:0 yes 3200.0000 800.0000
69 0 0 5 5:5:5:0 yes 3200.0000 800.0000
70 0 0 6 6:6:6:0 yes 3200.0000 800.0000
71 0 0 7 7:7:7:0 yes 3200.0000 800.0000
72 0 0 8 8:8:8:0 yes 3200.0000 800.0000
73 0 0 9 9:9:9:0 yes 3200.0000 800.0000
74 0 0 10 10:10:10:0 yes 3200.0000 800.0000
75 0 0 11 11:11:11:0 yes 3200.0000 800.0000
76 0 0 12 12:12:12:0 yes 3200.0000 800.0000
77 0 0 13 13:13:13:0 yes 3200.0000 800.0000
78 0 0 14 14:14:14:0 yes 3200.0000 800.0000
79 0 0 15 15:15:15:0 yes 3200.0000 800.0000
80 0 0 16 16:16:16:0 yes 3200.0000 800.0000
81 0 0 17 17:17:17:0 yes 3200.0000 800.0000
82 0 0 18 18:18:18:0 yes 3200.0000 800.0000
83 0 0 19 19:19:19:0 yes 3200.0000 800.0000
84 0 0 20 20:20:20:0 yes 3200.0000 800.0000
85 0 0 21 21:21:21:0 yes 3200.0000 800.0000
86 0 0 22 22:22:22:0 yes 3200.0000 800.0000
87 0 0 23 23:23:23:0 yes 3200.0000 800.0000
88 0 0 24 24:24:24:0 yes 3200.0000 800.0000
89 0 0 25 25:25:25:0 yes 3200.0000 800.0000
90 0 0 26 26:26:26:0 yes 3200.0000 800.0000
91 0 0 27 27:27:27:0 yes 3200.0000 800.0000
92 0 0 28 28:28:28:0 yes 3200.0000 800.0000
93 0 0 29 29:29:29:0 yes 3200.0000 800.0000
94 0 0 30 30:30:30:0 yes 3200.0000 800.0000
95 0 0 31 31:31:31:0 yes 3200.0000 800.0000
96 1 1 32 32:32:32:1 yes 3200.0000 800.0000
97 1 1 33 33:33:33:1 yes 3200.0000 800.0000
98 1 1 34 34:34:34:1 yes 3200.0000 800.0000
99 1 1 35 35:35:35:1 yes 3200.0000 800.0000
100 1 1 36 36:36:36:1 yes 3200.0000 800.0000
101 1 1 37 37:37:37:1 yes 3200.0000 800.0000
102 1 1 38 38:38:38:1 yes 3200.0000 800.0000
103 1 1 39 39:39:39:1 yes 3200.0000 800.0000
104 1 1 40 40:40:40:1 yes 3200.0000 800.0000
105 1 1 41 41:41:41:1 yes 3200.0000 800.0000
106 1 1 42 42:42:42:1 yes 3200.0000 800.0000
107 1 1 43 43:43:43:1 yes 3200.0000 800.0000
108 1 1 44 44:44:44:1 yes 3200.0000 800.0000
109 1 1 45 45:45:45:1 yes 3200.0000 800.0000
110 1 1 46 46:46:46:1 yes 3200.0000 800.0000
111 1 1 47 47:47:47:1 yes 3200.0000 800.0000
112 1 1 48 48:48:48:1 yes 3200.0000 800.0000
113 1 1 49 49:49:49:1 yes 3200.0000 800.0000
114 1 1 50 50:50:50:1 yes 3200.0000 800.0000
115 1 1 51 51:51:51:1 yes 3200.0000 800.0000
116 1 1 52 52:52:52:1 yes 3200.0000 800.0000
117 1 1 53 53:53:53:1 yes 3200.0000 800.0000
118 1 1 54 54:54:54:1 yes 3200.0000 800.0000
119 1 1 55 55:55:55:1 yes 3200.0000 800.0000
120 1 1 56 56:56:56:1 yes 3200.0000 800.0000
121 1 1 57 57:57:57:1 yes 3200.0000 800.0000
122 1 1 58 58:58:58:1 yes 3200.0000 800.0000
123 1 1 59 59:59:59:1 yes 3200.0000 800.0000
124 1 1 60 60:60:60:1 yes 3200.0000 800.0000
125 1 1 61 61:61:61:1 yes 3200.0000 800.0000
126 1 1 62 62:62:62:1 yes 3200.0000 800.0000
127 1 1 63 63:63:63:1 yes 3200.0000 800.0000
And in lscpu we can see how the cpus add mapped in each socket (node):
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
So we did this to create an "portal" cgroup where processes will be mapped to NUMA node1 socket 32-42:
cgcreate -g cpuset:portal
apt install cgroup-tools
cgcreate -g cpuset:portal
echo 32-42 > /sys/fs/cgroup/cpuset/portal/cpuset.cpus
And the mapped that group to our isard-portal haproxy docker container like this:
isard-portal:
cgroup_parent: portal
container_name: isard-portal
environment:
API_DOMAIN: isard-api
So, why in this setup, where all the docker container threads are kept within 32-42 cgroup in the same socket, haproxy is still throwing the bug? Shoul'nt that be enough to avoid threads in both sockets, so haproxy thread-groups (https://docs.haproxy.org/2.9/configuration.html#3.1-cpu-map) is not necessary?
What I'm missing here?
That's interesting, I think your setup is not effective (for whatever reason) because while you seem to bind 11 threads (32-42), in your crash trace, there are 64 threads. Could you please check using "taskset -p $pid" the cpu list that the haproxy process is bound to ? Because it could be the start of an explanation (even if we don't know why it wouldn't work).
Sorry that log was with unpinned cpus (https://github.com/haproxy/haproxy/issues/2713#issuecomment-2338584153). We got another crash, this time with pinned gpus. This is the taskset now:
taskset -p 2233589
pid 2233589's current affinity mask: 7ff00000000
So in binary will be 01111111111100000000000000000000000000000000
It's also crashing with this setup
and do you have the trace output in this case please ? I'm also very surprised that in the previous trace there was no backtrace of the calling functions. Out of my head right now I don't see what could be responsible for these not being displayed.
Let's wait for next hung to be sure it's like this now.
We've got the bug again with this previous cgroups setup (https://github.com/haproxy/haproxy/issues/2713#issuecomment-2341566720). It shows 64 threads, even the container is started with the cgroup_parent previous setup shown. We'll investigate at docker side more throughfully why it's contained haproxy is spawning 64 threads instead of the 11 limited threads.
We had this server for many years running this container (with haproxy 2.4 I think) without cgroup and never had this issues. We upgraded to 2.9 and it seems then this issues started. And now to haproxy 3 to see if fixed something. So we will look to the server intervention history and try to see also if it has something to do. Maybe worth trying to build the old container and check if it reproduces the bug. We'll tell you about.
Here you've got the last traceback:
Thread 50 is about to kill the process.
Thread 1 : id=0x7efe000ccb80 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/1 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=55840406167 now=55840412213 diff=6046
curr_task=0
Thread 2 : id=0x7efdf7e5cb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/2 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53244021208 now=53244031788 diff=10580
curr_task=0
Thread 3 : id=0x7efdf7e36b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/3 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=83323521534 now=83331525150 diff=8003616
curr_task=0
Thread 4 : id=0x7efdf7e10b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/4 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54547905254 now=54555920514 diff=8015260
curr_task=0
Thread 5 : id=0x7efdf7deab30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/5 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=79062739971 now=79062747742 diff=7771
curr_task=0
Thread 6 : id=0x7efdf7dc4b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/6 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52062327310 now=52062333878 diff=6568
curr_task=0
Thread 7 : id=0x7efdf7d9eb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/7 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=71005810212 now=71005818743 diff=8531
curr_task=0
Thread 8 : id=0x7efdf7d67b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/8 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52449565561 now=52453574282 diff=4008721
curr_task=0
Thread 9 : id=0x7efdf7940b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/9 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=60291224073 now=60291232958 diff=8885
curr_task=0
Thread 10: id=0x7efdf791ab30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/10 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54644634241 now=54648647623 diff=4013382
curr_task=0
Thread 11: id=0x7efdf78f4b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/11 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52558974281 now=52574969718 diff=15995437
curr_task=0
Thread 12: id=0x7efdf78ceb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/12 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=56164288866 now=56164300266 diff=11400
curr_task=0
Thread 13: id=0x7efdf78a8b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/13 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52930415679 now=52934421114 diff=4005435
curr_task=0
Thread 14: id=0x7efdf7882b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/14 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53056442062 now=53060449294 diff=4007232
curr_task=0
Thread 15: id=0x7efdf785cb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/15 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53506021109 now=53506028028 diff=6919
curr_task=0
Thread 16: id=0x7efdf7836b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/16 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=76871692491 now=76871700486 diff=7995
curr_task=0
Thread 17: id=0x7efdf7810b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/17 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53521426304 now=53525435420 diff=4009116
curr_task=0
Thread 18: id=0x7efdf77eab30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/18 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=53070321465 now=53070331786 diff=10321
curr_task=0
Thread 19: id=0x7efdf77c4b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/19 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=77024552819 now=77024562473 diff=9654
curr_task=0
Thread 20: id=0x7efdf779eb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/20 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=54443784484 now=54443793958 diff=9474
curr_task=0
Thread 21: id=0x7efdf7778b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/21 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=52641130164 now=52641139738 diff=9574
curr_task=0
Thread 22: id=0x7efdf7752b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/22 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=52693313001 now=52693317797 diff=4796
curr_task=0
Thread 23: id=0x7efdf772cb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/23 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54306628178 now=54306641666 diff=13488
curr_task=0
Thread 24: id=0x7efdf7706b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/24 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53509271756 now=53509292515 diff=20759
curr_task=0
Thread 25: id=0x7efdf76e0b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/25 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=49460484780 now=49460513606 diff=28826
curr_task=0
Thread 26: id=0x7efdf76bab30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/26 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53789773642 now=53789783608 diff=9966
curr_task=0
Thread 27: id=0x7efdf7694b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/27 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52365460640 now=52377465383 diff=12004743
curr_task=0
Thread 28: id=0x7efdf766eb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/28 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52690357521 now=52690365900 diff=8379
curr_task=0
Thread 29: id=0x7efdf7648b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/29 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53757115996 now=53773118031 diff=16002035
curr_task=0
Thread 30: id=0x7efdf7622b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/30 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52328984850 now=52332991489 diff=4006639
curr_task=0
Thread 31: id=0x7efdf75fcb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/31 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54648207624 now=54648214736 diff=7112
curr_task=0
Thread 32: id=0x7efdf75d6b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/32 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=51734663104 now=51742666304 diff=8003200
curr_task=0
Thread 33: id=0x7efdf75b0b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/33 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53725383820 now=53725390743 diff=6923
curr_task=0
Thread 34: id=0x7efdf758ab30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/34 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54246618656 now=54246624879 diff=6223
curr_task=0
Thread 35: id=0x7efdf7564b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/35 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52600629881 now=52624637946 diff=24008065
curr_task=0
Thread 36: id=0x7efdf753eb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/36 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=51608984650 now=51608994221 diff=9571
curr_task=0
>Thread 37: id=0x7efdf7518b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/37 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=51429454370 now=53640206132 diff=2210751762
curr_task=0
Thread 38: id=0x7efdf74f2b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/38 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=55075172358 now=55079173640 diff=4001282
curr_task=0
Thread 39: id=0x7efdf74ccb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/39 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=60629646580 now=60629663863 diff=17283
curr_task=0
Thread 40: id=0x7efdf74a6b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/40 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=54468454345 now=54468464533 diff=10188
curr_task=0
Thread 41: id=0x7efdf7480b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/41 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54083118663 now=54083127716 diff=9053
curr_task=0
Thread 42: id=0x7efdf745ab30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/42 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=57356857485 now=57356864795 diff=7310
curr_task=0
Thread 43: id=0x7efdf7434b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/43 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53267770056 now=53267777325 diff=7269
curr_task=0
Thread 44: id=0x7efdf740eb30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/44 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=54110428109 now=54110436248 diff=8139
curr_task=0
Thread 45: id=0x7efdf73e8b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/45 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52905496028 now=52913501681 diff=8005653
curr_task=0
>Thread 46: id=0x7efdf73c2b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/46 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=51309579821 now=53634037744 diff=2324457923
curr_task=0
>Thread 47: id=0x7efdf739cb30 act=1 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/47 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=60259718249 now=62436339596 diff=2176621347
curr_task=0x7efdfea11370 (task) calls=94581 last=0
fct=0x560717855040(process_resolvers) ctx=0x7efdffe81f20
Thread 48: id=0x7efdf7376b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/48 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54624133672 now=54624140851 diff=7179
curr_task=0
Thread 49: id=0x7efdf7350b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/49 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=52822605566 now=52822612275 diff=6709
curr_task=0
*>Thread 50: id=0x7efdf732ab30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/50 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=52220518024 now=54478320130 diff=2257802106
curr_task=0
Thread 51: id=0x7efdf7304b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/51 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53780709610 now=53780717394 diff=7784
curr_task=0
Thread 52: id=0x7efdf72deb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/52 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=54062476133 now=54062486047 diff=9914
curr_task=0
Thread 53: id=0x7efdf72b8b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/53 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=52213279555 now=52213287400 diff=7845
curr_task=0
Thread 54: id=0x7efdf7292b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/54 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54853516048 now=54861519528 diff=8003480
curr_task=0
Thread 55: id=0x7efdf726cb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/55 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53298056475 now=53298066452 diff=9977
curr_task=0
Thread 56: id=0x7efdf7246b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/56 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=54311461149 now=54315462674 diff=4001525
curr_task=0
Thread 57: id=0x7efdf7220b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/57 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53795544778 now=53807552870 diff=12008092
curr_task=0
Thread 58: id=0x7efdf71fab30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/58 stuck=0 prof=0 harmless=1 isolated=0
cpu_ns: poll=53674245360 now=53674249825 diff=4465
curr_task=0
Thread 59: id=0x7efdf71d4b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/59 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=65124987231 now=65128999251 diff=4012020
curr_task=0
Thread 60: id=0x7efdf71aeb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/60 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53302752574 now=53314760001 diff=12007427
curr_task=0
Thread 61: id=0x7efdf7188b30 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/61 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53682043106 now=53682050083 diff=6977
curr_task=0
Thread 62: id=0x7efdf7162b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/62 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=64019161210 now=64019173907 diff=12697
curr_task=0
Thread 63: id=0x7efdf713cb30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/63 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=51801871369 now=51809875863 diff=8004494
curr_task=0
Thread 64: id=0x7efdf7116b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/64 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=53507949545 now=53507956122 diff=6577
curr_task=0
HAProxy version 3.0.4-7a59afa 2024/09/03 - https://haproxy.org/
[NOTICE] (37) : haproxy version is 3.0.4-7a59afa
[ALERT] (37) : Current worker (39) exited with code 134 (Aborted)
[WARNING] (37) : A worker process unexpectedly died and this can only be explained by a bug in haproxy or its dependencies.
Please check that you are running an up to date and maintained version of haproxy and open a bug report.
[ALERT] (37) : exit-on-failure: killing every processes with SIGTERM
[WARNING] (37) : All workers exited. Exiting... (134)
Status: long-term supported branch - will stop receiving fixes around Q2 2029.
Known bugs: http://www.haproxy.org/bugs/bugs-3.0.4.html
Running on: Linux 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64
It's possible that your load increased maybe. Are you certain you didn't force the thread count to 64 anywhere ? Normally if you start with more threads than CPUs available, a warning is emitted.
Also if you want you can force the thread count using "nbthreads" and cpu bindings using the cpu-map directive. But it would for sure be better to know why the configured cgroup mapping isn't respected :-)
Load is even lower this days. No way we set manually 64 cores.
Now trying with numactl --cpubind=0 --membind=0 docker compose -f docker-compose-portal.yml up -d to be sure the bug is dual socket related.
Some days till we tried with numactl. The bug reproduced also. We've got now another dual socket behaving with the same problem.
We added a cron script monitoring the docker haproxy logs to do a restart meanwhile...
Any ideas to try?
I'll try now with:
isard-portal:
cpuset: "0-31"
cgroup_parent: portal
container_name: isard-portal
environment:
API_DOMAIN: isard-api
But I see that the main haproxy process inside docker may not be pinned, but the forked one seems to be... in other cpusets? I don't know why this values.
root 160852 1.0 0.0 16152 8640 ? Ss 15:38 0:00 \_ haproxy -W -db -f /usr/local/etc/haproxy/haproxy.cfg
root 160925 0.0 0.0 1604 4 ? S 15:38 0:00 \_ inotifyd haproxy-reload /certs/chain.pem c
root 160927 353 0.0 428868 100468 ? Sl 15:38 1:31 \_ haproxy -W -db -f /usr/local/etc/haproxy/haproxy.cfg
# taskset -cp 160852
pid 160852's current affinity list: 0-127
# taskset -cp 160927
pid 160927's current affinity list: 32-63,96-127
I don't know how the CPU pinning is passed through docker but there are obviously problems here. I think you'd rather not ping anything and instead do the pinning from within haproxy itself, using the nbthread and cpu-map directives. E.g:
global
nbthread 32
cpu-map 1-32 0-31
or this if you also want to enable the second thread of each core (though the former should be sufficient):
global
nbthread 64
cpu-map 1-64 0-31,64-95
And you're done.
Also since you don't have backtraces enabled due to musl, it would be useful to enable core dumps and check where it died. For this you can add set-dumpable in your global section, and make sure your system collects core dumps, even in docker containers. Then after opening the haproxy binary and the core together under gdb, you can issue "t a a bt full" and you'll get a nice complete backtrace of each thread.
I'll try now with:
isard-portal: cpuset: "0-31" cgroup_parent: portal container_name: isard-portal environment: API_DOMAIN: isard-apiBut I see that the main haproxy process inside docker may not be pinned, but the forked one seems to be... in other cpusets? I don't know why this values.
root 160852 1.0 0.0 16152 8640 ? Ss 15:38 0:00 \_ haproxy -W -db -f /usr/local/etc/haproxy/haproxy.cfg root 160925 0.0 0.0 1604 4 ? S 15:38 0:00 \_ inotifyd haproxy-reload /certs/chain.pem c root 160927 353 0.0 428868 100468 ? Sl 15:38 1:31 \_ haproxy -W -db -f /usr/local/etc/haproxy/haproxy.cfg# taskset -cp 160852 pid 160852's current affinity list: 0-127 # taskset -cp 160927 pid 160927's current affinity list: 32-63,96-127
Two days since this and no bug appeared. If it comes again I'll do the nbthread fix, thanks.
Well, at least the process is correctly pinned to a single node. It's just surprising that it doesn't correspond to the one you've chosen!
@jvinolas, I'm closing the issue but if you have new info, feel free to reopen it. Thanks !
Hello,
I am running into the same issue (i am running the isard-portal service itself)
I am facing this issue on dual-socket system...
Here is more info on the logs
WARNING! thread 48 has stopped processing traffic for 1099 milliseconds
with 0 streams currently blocked, prevented from making any progress.
While this may occasionally happen with inefficient configurations
involving excess of regular expressions, map_reg, or heavy Lua processing,
this must remain exceptional because the system's stability is now at risk.
Timers in logs may be reported incorrectly, spurious timeouts may happen,
some incoming connections may silently be dropped, health checks may
randomly fail, and accesses to the CLI may block the whole process. The
blocking delay before emitting this warning may be adjusted via the global
'warn-blocked-traffic-after' directive. Please check the trace below for
any clues about configuration elements that need to be corrected:
*>Thread 48: id=0x7202d98b0b30 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
1/48 stuck=1 prof=0 harmless=0 isolated=0
cpu_ns: poll=2133566308 now=3232754093 diff=1099187785
curr_task=0
=> Trying to gracefully recover now.
Thread 18 is about to kill the process.
I am attaching the logs as a file.
As a test, I will be running the same service, but on a single socket server (with 96C/96T). Lets see how it does there. Meanwhile, please let me know if you need more info to solve this issue.
Hello! For others, I'm seeing in your trace that this is version 3.1.1. There are lines with many dates between brackets, I don't understand what these are. It seems there are some parts of haproxy logs there though I'm not sure. We'll need your config to sort all this out.
I'm also seeing quite a number of warnings due to very slow processing. Do you have many rules, or heavy regex or anything ? While it's running, could you please issue a "show dev" on the CLI and paste the output ?
I'm not seeing any backtrace, would you happen to be running on libmusl by any chance ? Could you please post the output of "haproxy -vv" so that we're up to date with your setup ?
And if you could isolate the core, open it in gdb and issue "t a a bt full" so that we see what the threads were doing, that would be great, because in the current state, we know nothing :-( Well, to be more precise, we only know that 2 threads were stuck, one of which was in process_resolvers().
By the way, do you have numerous servers in your config that are learned via the DNS ? If so it might be a cause, as there are still some heavy resolution paths there, though I can never enumerate them, as DNS-based discovery is extremely heavy as it tries to spot missing addresses and fill holes without creating duplicates.
Reopened for now.
Hi, here are some more info..
The isard-portal (https://gitlab.com/isard/isardvdi) builds on top of official haproxy docker images (i believe the current one runs on haproxy:3.1.1-alpine3.21 )
There is 17 services being learned via DNS
haproxy -vv
HAProxy version 3.1.1-717960d 2024/12/11 - https://haproxy.org/
Status: stable branch - will stop receiving fixes around Q1 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-3.1.1.html
Running on: Linux 6.8.0-50-generic #51-Ubuntu SMP PREEMPT_DYNAMIC Sat Nov 9 17:58:29 UTC 2024 x86_64
Build options :
TARGET = linux-musl
CC = cc
CFLAGS = -O2 -g -fwrapv
OPTIONS = USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_PROMEX=1 USE_PCRE2=1 USE_PCRE2_JIT=1
DEBUG =
Feature list : -51DEGREES +ACCEPT4 -BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=256).
Built with OpenSSL version : OpenSSL 3.3.2 3 Sep 2024
Running on OpenSSL version : OpenSSL 3.3.2 3 Sep 2024
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
OpenSSL providers loaded : default
Built with Lua version : Lua 5.4.7
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.43 2024-02-16
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 14.2.0
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE|BE mux=H2 flags=HTX|HOL_RISK|NO_UPG
<default> : mode=HTTP side=FE|BE mux=H1 flags=HTX
h1 : mode=HTTP side=FE|BE mux=H1 flags=HTX|NO_UPG
fcgi : mode=HTTP side=BE mux=FCGI flags=HTX|HOL_RISK|NO_UPG
<default> : mode=SPOP side=BE mux=SPOP flags=HOL_RISK|NO_UPG
spop : mode=SPOP side=BE mux=SPOP flags=HOL_RISK|NO_UPG
<default> : mode=TCP side=FE|BE mux=PASS flags=
none : mode=TCP side=FE|BE mux=PASS flags=NO_UPG
Available services : prometheus-exporter
Available filters :
[BWLIM] bwlim-in
[BWLIM] bwlim-out
[CACHE] cache
[COMP] compression
[FCGI] fcgi-app
[SPOE] spoe
[TRACE] trace
haproxy config (the isard service creates this config)
### START 00_begin.cfg ###
resolvers mydns
nameserver dns1 127.0.0.11:53
global
daemon
tune.ssl.default-dh-param 2048
log stdout format raw local0
defaults
mode http
timeout connect 25s
timeout client 25s
timeout client-fin 25s
timeout server 25s
timeout tunnel 7200s
option http-server-close
option httpclose
maxconn 2000
option tcpka
option forwardfor
option persist
timeout tarpit 12s
### END 00_begin.cfg ###
### START 01_logs.cfg ###
.if defined(HAPROXY_LOGGING)
log global
option httplog
.endif
# Don't log normal access. Disable to get all requests in log.
.if !defined(HAPROXY_LOGGING_NORMAL)
option dontlog-normal
.endif
# https://www.haproxy.com/blog/haproxy-log-customization/
log-format '{"time": "[%t]", "src":"%[src]", "method":"%[capture.req.method]", "status": "%ST", "uri":"%[capture.req.uri]", "backend":"%s", "blk":"%[var(txn.block)]"}'
### END 01_logs.cfg ###
### START 04_squid.cfg ###
frontend fe_proxy_squid
bind 0.0.0.0:80
mode tcp
option tcplog
tcp-request inspect-delay 10s
# Blacklist & Whitelist
acl blacklisted src -f /usr/local/etc/haproxy/lists/black.lst -f /usr/local/etc/haproxy/lists/external/black.lst
acl whitelisted src -f /usr/local/etc/haproxy/lists/white.lst
tcp-request content set-var(txn.block) str("BLACKLISTED") if blacklisted !whitelisted !{ path_beg -i /.well-known/acme-challenge/ }
tcp-request content reject if blacklisted !whitelisted !{ path_beg -i /.well-known/acme-challenge/ }
tcp-request content accept if { ssl_fc }
tcp-request content accept if !HTTP
tcp-request content capture req.hdr(Host) len 150
acl is_subdomain hdr_sub(Host) -i end ".${DOMAIN}"
use_backend be_subdomain if is_subdomain
use_backend be_letsencrypt if { path_beg -i /.well-known/acme-challenge/ }
use_backend redirecthttps-backend if !{ method CONNECT }
default_backend be_isard-squid
backend be_isard-squid
mode tcp
option redispatch
option abortonclose
server squid isard-squid:8080 check port 8080 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_subdomain
mode tcp
server bastion isard-bastion:1313 check port 1313 inter 5s rise 2 fall 3 resolvers mydns init-addr none
### END 04_squid.cfg ###
### START 12_rdp.cfg ###
frontend RDP
mode tcp
bind *:9999
# Blacklist & Whitelist
acl blacklisted src -f /usr/local/etc/haproxy/lists/black.lst -f /usr/local/etc/haproxy/lists/external/black.lst
acl whitelisted src -f /usr/local/etc/haproxy/lists/white.lst
http-request set-var(txn.block) str("BLACKLISTED") if blacklisted !whitelisted
http-request reject if blacklisted !whitelisted
default_backend be_isard-rdpgw
backend be_isard-rdpgw
mode tcp
# http-request replace-path /rdpgw/(.*) /\1
server vpn isard-vpn:1313 maxconn 1000 check port 1313 inter 5s rise 2 fall 3 resolvers mydns init-addr none
### END 12_rdp.cfg ###
### START 16_00_fe_secured_begin.cfg ###
frontend fe_secured
bind 0.0.0.0:443
mode tcp
tcp-request inspect-delay 5s
tcp-request content accept if { req_ssl_hello_type 1 }
tcp-request content capture req_ssl_sni len 150
acl is_subdomain req_ssl_sni -m sub end ".${DOMAIN}"
use_backend be_subdomain if is_subdomain
default_backend be_ssl_backend
backend be_ssl_backend
mode tcp
server ssl_terminator 127.0.0.1:8443
frontend fe_ssl
bind 0.0.0.0:8443 ssl crt /certs/chain.pem
mode http
timeout client 3600s
maxconn 50000
option httpclose
option tcpka
# BEGIN ACLs
acl is_forbid_domain_ip env(FORBID_DOMAIN_IP) -m str true
acl is_domain hdr(host) -m str "${DOMAIN}"
acl is_blacklisted src -f /usr/local/etc/haproxy/lists/black.lst -f /usr/local/etc/haproxy/lists/external/black.lst
acl is_whitelisted src -f /usr/local/etc/haproxy/lists/white.lst
acl is_bad_path path_beg -i /. /BitKeeper
# END ACLs
# Blacklist & Whitelist
http-request set-var(txn.block) str("BLACKLISTED") if is_blacklisted !is_whitelisted
http-request reject if is_blacklisted !is_whitelisted
# Allow only $DOMAIN accesses, not IP
http-request set-var(txn.block) str("IP_ACCESS") if !is_domain is_forbid_domain_ip !is_whitelisted
http-request reject if !is_domain is_forbid_domain_ip !is_whitelisted
# Bad paths
http-request set-var(txn.block) str("BAD PATH") if is_bad_path
http-request reject if is_bad_path
# Security Headers
#https://cheatsheetseries.owasp.org/cheatsheets/HTTP_Headers_Cheat_Sheet.html
http-response del-header X-Powered-By
http-response del-header Server
http-response set-header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload"
http-response add-header X-Frame-Options DENY
http-response add-header X-XSS-Protection 0
# http-response set-header Content-Security-Policy:script-src https://<scripts domains> (only in devel)
http-response add-header Referrer-Policy no-referrer
http-response add-header X-Content-Type-Options nosniff
# BEGIN CORS
http-response add-header Access-Control-Allow-Origin "${CORS}"
http-response add-header Access-Control-Allow-Headers "Origin, X-Requested-With, Content-Type, Accept, Authorization"
http-response add-header Access-Control-Max-Age 3628800
http-response add-header Access-Control-Allow-Methods "GET, POST, PUT, DELETE"
# END CORS
### END 16_00_fe_secured_begin.cfg ###
### START 16_04_fe_secured_abuse.cfg ###
## Register Abuse
acl is_login_register path_beg /api/v3/user/register
## System Abuse
acl is_db_debug path_beg /debug/db
tcp-request inspect-delay 5s
tcp-request content track-sc0 src table AbuseSystem
# acl err_abuse src,table_http_err_rate(AbuseSystem) ge 25
# acl rate_abuse src,table_http_req_rate(AbuseSystem) ge 100
# use_backend err_limiter if err_abuse
# use_backend rate_limiter if rate_abuse !err_abuse
tcp-request content accept
acl authorized http_auth(AuthUsers)
tcp-request content accept if is_db_debug !authorized WAIT_END
http-request set-var(txn.block) str("ABUSE DB") if { src,table_http_err_rate(AbuseSystem) ge 4 } is_db_debug
http-request deny deny_status 401 if { src,table_http_err_rate(AbuseSystem) ge 4 } is_db_debug
http-request set-var(txn.block) str("ABUSE REGISTER") if { src,table_http_err_rate(AbuseSystem) ge 500 } is_login_register
http-request tarpit deny_status 429 if { src,table_http_err_rate(AbuseSystem) ge 500 } is_login_register
### END 16_04_fe_secured_abuse.cfg ###
### START 16_12_fe_secured_end.cfg ###
acl is_upgrade hdr(Connection) -i upgrade
acl is_websocket hdr(Upgrade) -i websocket
acl is_guacamole_ws path_beg /websocket-tunnel
acl is_guacamole_http path_beg /tunnel
acl is_frontend_dev_ws hdr(Sec-WebSocket-Protocol) -i vite-hmr
acl is_frontend_path path_beg /login or path_beg /migration or path_beg /frontend
acl is_old_frontend_dev_ws path_beg path_beg /sockjs-node/
acl is_api path_beg /api
http-request set-log-level silent if is_websocket
# GUACAMOLE ENDPOINTS
use_backend be_isard-guacamole if is_websocket is_guacamole_ws
use_backend be_isard-guacamole if is_guacamole_http
# AUTHENTICATION ENDPOINTS
use_backend be_isard-authentication if { path_beg /authentication }
# API ENDPOINTS
use_backend be_isard-apiv3 if { path_beg /api/v3 }
use_backend be_isard-apiv3 if is_websocket { path_beg /api/v3/socket.io }
# WEBAPP ENDPOINTS
use_backend be_isard-webapp if { path_beg /isard-admin } or { path_beg /isard-admin/ }
# SCHEDULER ENDPOINTS
use_backend be_isard-scheduler if { path_beg /scheduler }
# ENGINE ENDPOINTS
use_backend be_isard-engine if { path_beg /engine }
# DEFAULT WEBSOCKETS: HTML5 ENDPOINT
use_backend be_isard-websockify if is_websocket !is_frontend_dev_ws !is_old_frontend_dev_ws
# debug backends
use_backend be_isard-db if { path_beg /debug/db }
# use_backend be_isard-video if { path_beg /debug/video }
# graph backends
use_backend be_isard-grafana if { path_beg /monitor } or { path_beg /monitor/ }
# PROMETHEUS BACKEND
use_backend be_isard-prometheus if { path_beg /prometheus } or { path_beg /prometheus/ }
# NEXTCLOUD ENDPOINTS
use_backend be_isard-nc if { path_beg /isard-nc }
# develop backends
# This must be the last use_backend directive
use_backend be_isard-static if { env(DEVELOPMENT) -m str true } { path_beg /assets/ }
use_backend be_isard-frontend-dev if { env(DEVELOPMENT) -m str true } is_frontend_path
use_backend be_isard-old-frontend-dev if { env(DEVELOPMENT) -m str true } !{ path_beg /viewer/ } !{ path_beg /custom/ }
default_backend be_isard-static
### END 16_12_fe_secured_end.cfg ###
### START 20_backends.cfg ###
backend AbuseSystem
stick-table type ip size 1K expire 5m store http_err_rate(30s)
# backend rate_limiter
# mode http
# http-request deny deny_status 429
# backend err_limiter
# mode http
# http-request reject
backend be_isard-engine
server engine isard-engine:5000 check port 5000 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-guacamole
server guacamole isard-guac:4567 check port 4567 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-websockify
server websockify isard-websockify:8080 check port 8080 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-authentication
option forwardfor
server authentication isard-authentication:1313 maxconn 1000 check port 1313 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-static
server static isard-static:80 maxconn 1000 check port 80 inter 5s rise 2 fall 3 resolvers mydns
backend be_isard-nc
server nc-nginx isard-nc-nginx:80 maxconn 1000 check port 80 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-frontend-dev
http-request set-path /frontend%[path] if { path_reg ^/(?!frontend/) }
server frontend-dev isard-frontend-dev:5173 maxconn 1000 check port 5173 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-old-frontend-dev
server frontend-dev isard-old-frontend-dev:8080 maxconn 1000 check port 8080 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-db
acl authorized http_auth(AuthUsers)
http-request auth realm AuthUsers unless authorized
http-request redirect scheme http drop-query append-slash if { path -m str /debug/db }
http-request replace-path /debug/db/(.*) /\1
http-request del-header Authorization
server metrics-db "${RETHINKDB_HOST}":8080 maxconn 10 check port 8080 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-grafana
timeout server 300s
http-request set-header X-JWT-Assertion %[req.cook(isardvdi_session),regsub("^Bearer ","")]
server isard-grafana "${GRAFANA_HOST}":3000 maxconn 10 check port 3000 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-prometheus
# Require a JWT token in the Authorization header
http-request deny content-type 'text/html' string 'Missing Authorization HTTP header' unless { req.hdr(authorization) -m found }
# get header part of the JWT
http-request set-var(txn.alg) http_auth_bearer,jwt_header_query('$.alg')
# get payload part of the JWT
http-request set-var(txn.iss) http_auth_bearer,jwt_payload_query('$.iss')
http-request set-var(txn.kid) http_auth_bearer,jwt_payload_query('$.kid')
http-request set-var(txn.exp) http_auth_bearer,jwt_payload_query('$.exp','int')
http-request set-var(txn.role) http_auth_bearer,jwt_payload_query('$.data.role_id')
# Validate the JWT
http-request deny content-type 'text/html' string 'Unsupported JWT signing algorithm' unless { var(txn.alg) -m str HS256 }
http-request deny content-type 'text/html' string 'Invalid JWT issuer' unless { var(txn.iss) -m str isard-authentication }
http-request deny content-type 'text/html' string 'Invalid JWT Key ID' unless { var(txn.kid) -m str isardvdi }
http-request deny content-type 'text/html' string 'Invalid JWT signature' unless { http_auth_bearer,jwt_verify(txn.alg,"${API_ISARDVDI_SECRET}") -m int 1 }
http-request set-var(txn.now) date()
http-request deny content-type 'text/html' string 'JWT has expired' if { var(txn.exp),sub(txn.now) -m int lt 0 }
# Deny requests that lack sufficient permissions
http-request deny unless { var(txn.role) -m sub admin }
http-request set-path %[path,regsub(^/prometheus/?,/)]
server isard-prometheus isard-prometheus:9090 maxconn 1000 check port 9090 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-webapp
timeout queue 600s
timeout server 600s
timeout connect 600s
server static "${WEBAPP_HOST}":5000 maxconn 100 check port 5000 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-apiv3
option forwardfor
timeout queue 600s
timeout server 600s
timeout connect 600s
http-response set-header Access-Control-Allow-Origin "*"
server isard-api isard-api:5000 maxconn 1000 check port 5000 inter 5s rise 2 fall 3 resolvers mydns init-addr none
backend be_isard-scheduler
option forwardfor
timeout queue 5s
timeout server 10s
timeout connect 5s
http-response set-header Access-Control-Allow-Origin "*"
server isard-scheduler isard-scheduler:5000 maxconn 1000 check port 5000 inter 5s rise 2 fall 3 resolvers mydns init-addr none
### END 20_backends.cfg ###
### START 29_be_defaults.cfg ###
backend be_letsencrypt
server letsencrypt 127.0.0.1:8080
backend redirecthttps-backend
mode http
.if defined(HTTPS_PORT)
http-request redirect location https://%[hdr(host),field(1,:)]:"$HTTPS_PORT"%[capture.req.uri]
.endif
http-request redirect scheme https if !{ ssl_fc }
server localhost:443 127.0.0.1:443 check
backend be_drop
mode http
http-request silent-drop
### END 29_be_defaults.cfg ###
### START 30_stats.cfg ###
frontend prometheus
bind *:9090
http-request use-service prometheus-exporter if { path /metrics }
stats enable
stats uri /stats
stats refresh 10s
# listen stats
# bind 0.0.0.0:8888
# mode http
# stats enable
# option httplog
# stats show-legends
# stats uri /haproxy
# stats realm Haproxy\ Statistics
# stats refresh 5s
# #stats auth staging:Password
# #acl authorized http_auth(AuthUsers)
# #stats http-request auth unless authorized
# timeout connect 5000ms
# timeout client 50000ms
# timeout server 50000ms
### END 30_stats.cfg ###
### START 31_auth.cfg ###
userlist AuthUsers
user admin password hithere
### END 31_auth.cfg ###
Also, I am testing the same service on different Setups.
Here is a summary
On the all the setups I have access to now, it crashes the same way
Config 1 - Dual AMD EPYC 9654 96 (total of 192 C / 384 T) Confg 2 - Single AMD EPYC 9654 96 (96C / 96T - HT is disabled) Config 3 - Dual AMD EPYC 9354 32-Core ( 64C / 64T - HT is disabled)
On config 1 and 2 - the error comes out instantly and soon it quits (~mostly within 20min) On config 3 - The service dies after ~8hrs.
Lemme know if you need more details.
OK thanks. So it definitely looks like heavy contention. Interestingly, your number of threads is not forced so you're running 64 threads on all configs, but it will depend how the system distributes the threads. On the dual-socket, if threads from the same group are spread across the two sockets, it's extremely bad.
Do you know if your load is usually high ? I don't know this service but when I'm seeing authentication and portals, generally it's super light for the LB ? I suggested to the user above to just pin haproxy to a subset of the threads from the same socket with this:
global
nbthread 32
cpu-map 1-32 0-31
It solved the problem there. But if the load is very low you can even use nbthread 1 Note that EPYCs are notorious for having a very slow inter-CCD communication, so if your workload accepts it, it's best to limit yourself to the first cores of the first CCD only, for example by limiting yourself to 4 threads and cpus 0-3.
I'm currently working on adding features to make that much easier to auto-configure.
Okay.. I am trying this on all the 3 system configs I mentioned.. I will get back if any of the servers start throwing error
Just to be sure, should I create a cgroup and do it under it, or just config change would work?
No, no need to fiddle with cgroups, just change the config. Setting cpu-map will restrict the CPUs the process is allowed to run on. As long as nbhread is lower than or equal to the range you've set, you'll be fine.
Update - No crash till now.
In case you are releasing patches that correspond to this issue, please mention here, I'll be very happy to test them and get back on my servers.
I'm trying to get that done for 3.2. At minima I'd like to make it easier to limit the number of threads to one socket, one die, one CCX, a few cores etc, and set the groups accordingly.
Independently from this, I'm still interesting in trying to figure what subsystem triggers the issue for you. We've significantly reduced locking over time and maybe there's a pathologically bad one in your case that can be improved. We know that queuing and LB are still areas that need improvements, but any feedback (either direct backtraces, or some provided by gdb) could be very helpful.
By the way, since you're using musl, could you please try to use with USE_BACKTRACE=1 ? I know it doesn't work with all archs on musl, which is why it's not enabled by default. But it does work for some (I've seen x86 work fine at least). It would provide more exploitable backtraces when the watchdog triggers. That might be something we'd attempt to enable by default on certain combinations.
No difference in the outputs.. (I set the env var of USE_BACKTRACE=1 and also launched haproxy with USE_BACKTRACE=1 haproxy -f .... )
Lemme know if its supposed to be done differently.. Or any other info
Sorry for not having been clear. I meant to pass that to make when building haproxy, that is:
make TARGET=linux-musl ... USE_BACKTRACE=1
It may very well fail to build, but normally with a modern enough musl on x86 it should work. At least for me it does with a not-so-recent musl (1.2.2):
FATAL: bug condition "one > zero" matched at src/debug.c:852
This was triggered on purpose from the CLI 'debug dev bug' command.
call trace(10):
| 0x55fc6fe5748f [48 8d 05 32 93 4a 00 48]: debug_parse_cli_bug+0x86
| 0x55fc6fde7e4b [85 c0 75 28 48 8b 85 b8]: main-0x98d20
| 0x55fc6fde87de [48 8b 85 38 ff ff ff 48]: main-0x9838d
| 0x55fc6fe99ff5 [8b 05 f5 30 77 00 85 c0]: task_process_applet+0x575
| 0x55fc701eac27 [48 89 45 e8 eb 35 48 8b]: run_tasks_from_lists+0x645
| 0x55fc701eb45e [89 c2 8b 45 dc 29 d0 89]: process_runnable_tasks+0x706
| 0x55fc6fe800f8 [8b 05 36 62 7d 00 83 f8]: run_poll_loop+0x8b
| 0x55fc6fe8088e [48 8b 05 6b 60 79 00 48]: run_thread_poll_loop+0x359
| 0x7f4323254371 [48 89 c7 e8 78 fd ff ff]: ld-musl-x86_64:+0x54371
Update - No crash till now.
In case you are releasing patches that correspond to this issue, please mention here, I'll be very happy to test them and get back on my servers.
We applied also the cpu mapping with nbthreads and did not crash for many days now, so it seems to be fixed with that. Let's wait for next haproxy version to see if it fixes.
Meanwhile we did this change:
### START 00_begin.cfg ###
global
cpu-map 1-4 0 1 2 3
nbthread 4
daemon
tune.ssl.default-dh-param 2048
log stdout format raw local0
resolvers mydns
nameserver dns1 127.0.0.11:53
And we used this numa.sh in our server to generate with bash numa.sh 4 just in case it's useful (chatgpt did it):
#!/bin/bash
# Usage: ./numa.sh [MAX_THREADS]
CONFIG_FILE="haproxy_numa.cfg"
# Get maximum threads from argument (default to 0, meaning no limit)
MAX_THREADS=${1:-0}
# Get NUMA topology
NUM_NODES=$(numactl --hardware | grep "available" | awk '{print $2}')
TOTAL_THREADS=0
# Start the configuration
echo "global" > $CONFIG_FILE
# Loop through each NUMA node
for NODE in $(seq 0 $(($NUM_NODES - 1))); do
# Get CPUs for this node
CPUS=$(numactl --hardware | grep "node $NODE cpus:" | cut -d: -f2 | xargs)
CPU_COUNT=$(echo $CPUS | wc -w)
# Calculate remaining threads if MAX_THREADS is set
REMAINING_THREADS=$((MAX_THREADS - TOTAL_THREADS))
if [ "$MAX_THREADS" -gt 0 ] && [ "$REMAINING_THREADS" -le 0 ]; then
break
fi
# Adjust CPU_COUNT if it exceeds the remaining thread limit
if [ "$MAX_THREADS" -gt 0 ] && [ "$CPU_COUNT" -gt "$REMAINING_THREADS" ]; then
CPU_COUNT=$REMAINING_THREADS
CPUS=$(echo $CPUS | awk '{for (i=1; i<='"$CPU_COUNT"'; i++) printf $i " ";}')
fi
# Define thread range
START_THREAD=$(($TOTAL_THREADS + 1))
END_THREAD=$(($TOTAL_THREADS + CPU_COUNT))
TOTAL_THREADS=$END_THREAD
# Append cpu-map configuration
echo " cpu-map $START_THREAD-$END_THREAD $CPUS" >> $CONFIG_FILE
done
# Add the final nbthread line
echo " nbthread $TOTAL_THREADS" >> $CONFIG_FILE
echo "Generated HAProxy NUMA-aware config:"
cat $CONFIG_FILE