haproxy
haproxy copied to clipboard
Haproxy failed to process some gRPC responses
Detailed Description of the Problem
Haproxy between gRPC client and gRPC server fails to process some server responses and finishes H2 session. A session with this problem was captured by tcpdump: see real.dump.zip Port 3990 - haproxy Port 3991 - gRPC server GOAWAY frame is unexpected.
Some debugging shows it was sent by this code from mux_h2.c:
if (b_full(&h2c->dbuf) && h2c->dfl) {
/* too large frames */
h2c_error(h2c, H2_ERR_INTERNAL_ERROR);
ret = -1;
}
Expected Behavior
Response from gRPC server seems correct and should be handled by Haproxy.
Steps to Reproduce the Behavior
I tried to reproduce a problem by vtest. I didn't get exact behavior as in the dump but error from mux_h2.c
if (b_full(&h2c->dbuf) && h2c->dfl) {
/* too large frames */
h2c_error(h2c, H2_ERR_INTERNAL_ERROR);
ret = -1;
}
was got.
varnishtest "A test to reveal h2 buffer bug"
feature ignore_unknown_macro
server s1 {
rxpri
stream 0 {
txsettings
rxsettings
txsettings -ack
rxsettings
expect settings.ack == true
} -run
stream 1 {
rxreq
txresp -noadd -idxHdr 8 -litHdr inc plain "content-type" plain "application/grpc" -litHdr inc plain "grpc-accept-encoding" plain "identity, deflate, gzip" -nostrend
txdata -datalen 226 -nostrend
txresp -noadd -litHdr inc plain "grpc-status" plain "0"
} -run
stream 3 {
rxreq
txresp -noadd -idxHdr 8 -idxHdr 64 -idxHdr 63 -nostrend
txdata -datalen 15964 -nostrend
txresp -noadd -idxHdr 62
} -run
stream 5 {
rxreq
txresp -noadd -idxHdr 8 -idxHdr 64 -idxHdr 63 -nostrend
txdata -datalen 16384 -nostrend
txresp -noadd -idxHdr 62
} -run
} -start
haproxy h1 -conf {
global
log stderr local0 debug
defaults
log global
mode http
option httplog
maxconn 1000
timeout connect 5s
timeout client 5s
timeout server 5s
frontend fe
bind "fd@${fe}" proto h2
default_backend be_servers
backend be_servers
server srv1 ${s1_addr}:${s1_port} proto h2
} -start
client c1 -connect ${h1_fe_sock} {
txpri
stream 0 {
txsettings
rxsettings
txsettings -ack
rxsettings
expect settings.ack == true
} -run
stream 1 {
txreq -method POST -url "/1111"
rxhdrs
expect resp.status == 200
expect resp.http.content-type == "application/grpc"
expect resp.http.grpc-accept-encoding == "identity, deflate, gzip"
rxdata -some 1
rxhdrs
expect resp.http.grpc-status == 0
} -run
stream 3 {
txreq -method POST -url "/2222"
rxhdrs
expect resp.status == 200
expect resp.http.content-type == "application/grpc"
expect resp.http.grpc-accept-encoding == "identity, deflate, gzip"
rxdata -some 2
rxhdrs
expect resp.http.grpc-status == 0
} -start
stream 5 {
txreq -method POST -url "/3333"
rxhdrs
expect resp.status == 200
expect resp.http.content-type == "application/grpc"
expect resp.http.grpc-accept-encoding == "identity, deflate, gzip"
rxdata -some 2
rxhdrs
expect resp.http.grpc-status == 0
} -start
} -run
Output is here: vtest_output.txt
Do you have any idea what may have caused this?
No response
Do you have an idea how to solve the issue?
No response
What is your configuration?
Linux on WSL2 (Windows subsystem for Windows).
Output of haproxy -vv
HAProxy version 3.0-dev0 2023/12/05 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64
Build options :
TARGET = linux-glibc
CPU = generic
CC = cc
CFLAGS = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
OPTIONS = USE_OPENSSL= USE_LUA= USE_SYSTEMD= USE_PCRE=
DEBUG = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS
Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY -LUA -MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER -OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL -PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ -SSL -STATIC_PCRE -STATIC_PCRE2 -SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=8).
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built without PCRE or PCRE2 support (using libc's regex instead)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.0
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE|BE mux=H2 flags=HTX|HOL_RISK|NO_UPG
fcgi : mode=HTTP side=BE mux=FCGI flags=HTX|HOL_RISK|NO_UPG
<default> : mode=HTTP side=FE|BE mux=H1 flags=HTX
h1 : mode=HTTP side=FE|BE mux=H1 flags=HTX|NO_UPG
<default> : mode=TCP side=FE|BE mux=PASS flags=
none : mode=TCP side=FE|BE mux=PASS flags=NO_UPG
Available services : none
Available filters :
[BWLIM] bwlim-in
[BWLIM] bwlim-out
[CACHE] cache
[COMP] compression
[FCGI] fcgi-app
[SPOE] spoe
[TRACE] trace
Last Outputs and Backtraces
No response
Additional Information
No response
Indeed, there is an issue with the trailers parsing when some data are still present in the rxbuf. I suspect all stable versions are affected.
I pushed a fix but I referenced the wrong issue in the commit message. Here is the fix: 682f73b4fa6d76aa0b5b743fe92777822884772d
We will emit a 2.9.1 tomorrow. Have you any way to confirm the above patch fixes your issue by testing the 3.0-DEV ?
Hello! Thank you for the fix. It resolves my issue.
Thanks for the feedback.
@capflam I don't know if it is related to the current issue or not but i get strange behavior with trailing headers:
varnishtest "A test to reveal h2 buffer bug"
feature ignore_unknown_macro
server s1 {
rxpri
stream 0 {
txsettings
rxsettings
txsettings -ack
rxsettings
expect settings.ack == true
} -run
stream 1 {
rxreq
txresp -nostrend
txdata -datalen 16384 -nostrend
txresp -noadd -litHdr inc plain "grpc-status" plain "0"
} -run
} -start
haproxy h1 -conf {
global
log stderr local0 debug
defaults
log global
mode http
option httplog
maxconn 1000
timeout connect 5s
timeout client 5s
timeout server 5s
frontend fe
bind "fd@${fe}" proto h2
default_backend be_servers
backend be_servers
server srv1 ${s1_addr}:${s1_port} proto h2
} -start
client c1 -connect ${h1_fe_sock} {
txpri
stream 0 {
txsettings
rxsettings
txsettings -ack
rxsettings
expect settings.ack == true
} -run
stream 1 {
txreq -method POST -url "/"
rxhdrs
expect resp.status == 200
rxdata -some 2
rxhdrs
expect resp.http.grpc-status == 0
} -run
} -run
Trailing header is missing, but there is DATA frame with size 0 and END_STREAM flag
*** c1 rx: stream: 1, type: DATA (0), flags: 0x00, size: 15256
*** c1 rx: stream: 1, type: DATA (0), flags: 0x00, size: 1072
** c1.1 === rxhdrs
*** c1 rx: stream: 1, type: DATA (0), flags: 0x01, size: 0
*** c1 flag: END_STREAM
**** c1.1 s1 - no data
Just to be sure, it is an issue with your vtest script or something you observed with regular clients and servers on your platform ?
But it is not related with the trailers and my fix. What happens here is that the server closes quickly the connection after sending the response and HAProxy detects a connection error trying to send a WINDOW_UPDATE frame to the server. In this case, the response processing is aborted. I don't know if it is expected. The last 56 bytes for the payload and the trailers was received and wait to be parsed. I must check with Willy. However it is not a 2.9 regression.
Adding a barrier to sync the client and the server solves the issue with your vtest. And you should wait for 3 DATA frames on client side:
varnishtest "A test to reveal h2 buffer bug"
feature ignore_unknown_macro
barrier b1 cond 2 -cyclic
server s1 {
rxpri
stream 0 {
txsettings
rxsettings
txsettings -ack
rxsettings
expect settings.ack == true
} -run
stream 1 {
rxreq
txresp -nostrend
txdata -datalen 16384 -nostrend
txresp -noadd -litHdr inc plain "grpc-status" plain "0"
} -run
barrier b1 sync
} -start
haproxy h1 -conf {
global
log stderr local0 debug
defaults
log global
mode http
option httplog
maxconn 1000
timeout connect 5s
timeout client 5s
timeout server 5s
frontend fe
bind "fd@${fe}" proto h2
default_backend be_servers
backend be_servers
server srv1 ${s1_addr}:${s1_port} proto h2
} -start
client c1 -connect ${h1_fe_sock} {
txpri
stream 0 {
txsettings
rxsettings
txsettings -ack
rxsettings
expect settings.ack == true
} -run
stream 1 {
txreq -method POST -url "/"
rxhdrs
expect resp.status == 200
rxdata -some 3
rxhdrs
expect resp.http.grpc-status == 0
} -run
barrier b1 sync
} -run
Just to be sure, it is an issue with your vtest script or something you observed with regular clients and servers on your platform ?
There are no real incidents in production.
I must check with Willy
Should I create separate github issue for this problem?
Should I create separate github issue for this problem?
Well, for now, no. Just wait Willy's answer on the subject. I will let you know. Thanks again. In the mean time, I will still emit a 2.9.1. It is a quite old problem, no reason to delay the release, especially because it may be quite hard to fix.
I merged a series that should fix your last issue. It will not be backported immediately to be sure they don't introduced regressions. And probably only to 2.9.