Zombie sessions remain after reload with srv_queue + mux fallback (HAProxy 2.8.12)
Detailed Description of the Problem
HAProxy Version: 2.8.12 Platform: Linux 3.10.0-1127.19.1.el7.x86_64 Problem:
- Backend server does not response and increased queue count.
- Sessions stuck in HEopI state with .exp=<NEVER>, .req.state=MSG_DATA, .res.state=MSG_RPBEFORE
- There are too many stick tables. ( Maint status )
- shutdown session fails to clear them
- Uses srv_queue heavily + POST requests
- Seen during mux fallback (H2 → H1)
- Reload (graceful) does not clear old session
Observed:
- Zombie sessions remain for days
- No buffer content
- Status remains 502 (server unresponsive)
- FD not cleaned
etc: We used timeout queue 10s in config.
Expected Behavior
Session should timeout (we have timeout queue 10s) or be cleaned during reload.
Steps to Reproduce the Behavior
It's hard to reproduce.
Do you have any idea what may have caused this?
I do not set maxqueue config.
Do you have an idea how to solve the issue?
I can't.
What is your configuration?
defaults HTTP
#### logging ####
log global
#### error file list ####
errorfile 200 /opt//haproxy/etc/errfile/200.http
errorfile 400 /opt//haproxy/etc/errfile/400.http
errorfile 403 /opt//haproxy/etc/errfile/403.http
errorfile 408 /opt//haproxy/etc/errfile/408.http
errorfile 500 /opt//haproxy/etc/errfile/500.http
errorfile 502 /opt//haproxy/etc/errfile/502.http
#### mode(http or tcp) ####
mode http
#### option ####
option http-keep-alive
option http-ignore-probes
option dontlognull
option log-health-checks
#option dontlog-normal
option contstats
option redispatch
option tcp-smart-accept
option tcp-smart-connect
option abortonclose
option splice-auto
option allbackups
option h1-case-adjust-bogus-client
#### not comply with RFC7230 ####
option accept-invalid-http-request
#### server connection retries ####
retries 2
#### timeout parameter ####
timeout connect 3100
timeout client 65s
timeout server 65s
timeout check 3s
timeout queue 10s
timeout tarpit 10s
timeout http-request 10s
timeout http-keep-alive 60s
#### default proxy max connection ####
maxconn 3100000
#### default balance algorithm ####
balance roundrobin
maxconn 3100000
default-server maxconn 2000 inter 5s fastinter 2s downinter 10s rise 3 fall 2 slowstart 10s resolve-prefer ipv4
bind :443 tfo ssl curves X25519:P-256 crt-list /opt//haproxy/ssl/certs/crt-list.txt alpn h2,http/1.1 tls-ticket-keys /opt//haproxy/ssl/key/ticket.keys
bind quic4@:443 tfo ssl curves X25519:P-256 crt-list /opt//haproxy/ssl/certs/crt-list.txt alpn h3 tls-ticket-keys /opt//haproxy/ssl/key/ticket.keys
Output of haproxy -vv
HAProxy version 2.8.12-quic_v43 2024/11/08 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2028.
Known bugs: http://www.haproxy.org/bugs/bugs-2.8.12.html
Running on: Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64
Build options :
TARGET = linux-glibc
CPU = native
CC = cc
CFLAGS = -Og -march=native -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment -DHAPROXY_TARGET_VERSION=280 -DTLS_TICKETS_NO=4
OPTIONS = USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_SYSTEMD=1 USE_QUIC=1 USE_STATIC_PCRE2=1 USE_PCRE2=1 USE_PCRE2_JIT=1
DEBUG = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS
Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_WOLFSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL -PROMEX -PTHREAD_EMULATION +QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN -SLZ +SSL -STATIC_PCRE +STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL +ZLIB
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=48).
Built with OpenSSL version : OpenSSL 1.1.1w+quic 11 Sep 2023
Running on OpenSSL version : OpenSSL 1.1.1w+quic 11 Sep 2023
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with network namespace support.
Built with Naver SSL Client Hello request capture. version: RB-1.1.2:29059
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.43 2024-02-16
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 4.8.5 20150623 (Red Hat 4.8.5-39)
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
quic : mode=HTTP side=FE mux=QUIC flags=HTX|NO_UPG|FRAMED
h2 : mode=HTTP side=FE|BE mux=H2 flags=HTX|HOL_RISK|NO_UPG
fcgi : mode=HTTP side=BE mux=FCGI flags=HTX|HOL_RISK|NO_UPG
<default> : mode=HTTP side=FE|BE mux=H1 flags=HTX
h1 : mode=HTTP side=FE|BE mux=H1 flags=HTX|NO_UPG
<default> : mode=TCP side=FE|BE mux=PASS flags=
none : mode=TCP side=FE|BE mux=PASS flags=NO_UPG
Available services : none
Available filters :
[BWLIM] bwlim-in
[BWLIM] bwlim-out
[CACHE] cache
[COMP] compression
[FCGI] fcgi-app
[SPOE] spoe
[TRACE] trace
Last Outputs and Backtraces
Additional Information
show fd
357360 : st=0x011922(cL HEopI W:sRa R:sRa) ref=0 gid=1 tmask=0x1000000000 umask=0x0 prmsk=0x0 pwmsk=0x0 owner=0x7fc4a7a8eb40 iocb=0x63928e(sock_conn_iocb) back=1 cflg=0x001c0300 cerr=0 fam=ipv4 lport=16876 sv=246507f2/srv1 mux=H1 ctx=0x7fc4cd88d390 h1c.flg=0x80000d00 .sub=0 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7fc68a935320 .exp=<NEVER> h1s=0x16ccaa80 h1s.flg=0x4010 .sd.flg=0x5020a01 .req.state=MSG_DATA .res.state=MSG_RPBEFORE .meth=POST status=0 .sd.flg=0x05020a01 .sc.flg=0x00034407 .sc.app=0x7fc5ce40ba50 .subs=(nil) xprt=RAW
show sess 0x7fc5ce40ba50
0x7fc5ce40ba50: [04/Apr/2025:11:26:40.806171] id=398211855 proto=tcpv4 source=61.99.76.19:32808 flags=0x83580a, conn_retries=0, conn_exp=<NEVER> conn_et=0x200 srv_conn=(nil), pend_pos=(nil) waiting=0 epoch=0 frontend=http (id=14 mode=http), listener=? (id=11) addr=10.10.10.10:443 backend=246507f2 (id=20 mode=http) addr=10.107.57.150:16876 server=srv1 (id=2) addr=10.200.200.1:80 task=0x7fc5d203a420 (state=0x00 nice=0 calls=52120 rate=1 exp=3s tid=36(1/36) age=3d24m) txn=0x7fc36ed6c240 flags=0x40000 meth=3 status=502 req.st=MSG_DATA rsp.st=MSG_RPBEFORE req.f=0x0d rsp.f=0x00 scf=0x7fc3f4a179e0 flags=0x0001c000 state=EST endp=CONN,0x7fc3f63366c0,0x042a1001 sub=0 rex=<NEVER> wex=<NEVER> h2s=0x7fc3f63366c0 h2s.id=15 .st=CLO .flg=0x4105 .rxbuf=32768@0x7fc2b9ff28f0+0/32768 .sc=0x7fc3f4a179e0(.flg=0x0001c000 .app=0x7fc5ce40ba50) .sd=0x7fc3f63367e0(.flg=0x042a1001) .subs=(nil) h2c=0x7fc5cf439aa0 h2c.st0=FRH .err=0 .maxid=19 .lastid=-1 .flg=0x70200 .nbst=0 .nbsc=1, .glitches=0 .fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=0 .dbuf=0@(nil)+0/0 .mbuf=[1..1|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=0x7fc75e010d20 .exp=<NEVER> co0=0x7fc4b94c6c80 ctrl=tcpv4 xprt=SSL mux=H2 data=STRM target=LISTENER:0x219f9e0 flags=0x80040300 fd=39490 fd.state=122 updt=0 fd.tmask=0x1000000000 scb=0x7fc3f4a17b20 flags=0x00034407 state=CLO endp=CONN,0x16ccaa80,0x05020a01 sub=0 rex=<NEVER> wex=<NEVER> h1s=0x16ccaa80 h1s.flg=0x4010 .sd.flg=0x5020a01 .req.state=MSG_DATA .res.state=MSG_RPBEFORE .meth=POST status=0 .sd.flg=0x05020a01 .sc.flg=0x00034407 .sc.app=0x7fc5ce40ba50 .subs=(nil) h1c=0x7fc4cd88d390 h1c.flg=0x80000d00 .sub=0 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7fc68a935320 .exp=<NEVER> co1=0x7fc4a7a8eb40 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=SERVER:0x14001f40 flags=0x001c0300 fd=357360 fd.state=11922 updt=0 fd.tmask=0x1000000000 req=0x7fc5ce40ba70 (f=0x848000 an=0x0 pipe=0 tofwd=0 total=1887) an_exp=3s buf=0x7fc5ce40ba78 data=(nil) o=0 p=0 i=0 size=0 htx=0xa16ca0 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0 res=0x7fc5ce40bac0 (f=0x80008000 an=0x0 pipe=0 tofwd=0 total=5618) an_exp=<NEVER> buf=0x7fc5ce40bac8 data=0x7fc2d1725580 o=5618 p=5618 i=27150 size=32768 htx=0x7fc2d1725580 flags=0x18 size=32720 data=5618 used=6 wrap=NO extra=0
Could you test the 2.8.14 please ? There is a regression in 2.8.12 that may explain your issue. It was fixed with the following commit, shipped with the 2.8.14:
commit 67f67566335b3ebdb33ba68f3baed5c4dbae8c9f
Author: Christopher Faulet <[email protected]>
Date: Thu Jan 2 11:24:06 2025 +0100
BUG/MEDIUM: stconn: Really report blocked send if sends are blocked by an error
A regression was introduced in commit 5ec25102ed ("BUG/MEDIUM: stconn:
Report blocked send if sends are blocked by an error") when the patch was
backported. Instead of checking if some outgoing data were blocked, the
opposite was performed. The write timeout must be armed is the channel IS
NOT empty.
It should fix the issue #2754 for the 2.8. It is 2.8-specific. there is no
upstream commit ID and no backport is needed.
@capflam Actually I have no way of reproducing it.
I wonder if cpu can rise due to the above issue. The cpu rose steeply at the time of the problem. After the reload the cpu went down but the session is still stuck.