mod_h2
mod_h2 copied to clipboard
infinite loop on windows graceful restart with proxy websocket
Hi @icing I have a bug with mod_h2 and "proxyfied" websockets when restarting the service (graceful I suppose) on Windows This problem reminds me of https://bz.apache.org/bugzilla/show_bug.cgi?id=65180 but is not fully identical.
httpd 2.4.58 self compiled vs17 x64 mod_h2 2.0.26
ProtocolsHonorOrder On Protocols h2 h2c http/1.1 acme-tls/1 H2Upgrade On H2Direct On H2Push On H2EarlyHints On H2ModernTLSOnly On H2WebSockets Off # as unsupported on Windows H2WindowSize 1460000000 H2StreamMaxMemSize 1460000000 H2MinWorkers 8 H2MaxWorkers 100 H2MaxSessionStreams 100 H2MaxDataFrameLen 0
- I have 3 vhosts configured (grafana, gitlab and mattermost) like this (with different ports...)
ProxyPass / http://127.0.0.1:7000/ upgrade=websocket nocanon retry=0 keepalive=On
ProxyPassReverse / http://127.0.0.1:7000/
Note: I did the test described below by deactivating
nocanonretry=0keepalive=Onproperties one by one but that doesn't change anything.
- I start httpd and launch my browser tabs which requires a websocket connection
- ✅ httpd fork is ~100% idle
- F12
101 Switching Protocolsfor all and I see the exchanges in the console
wss://gitlab.mydomain.com/-/cable
wss://mattermost.mydomain.com/api/v4/websocket?connection_id=&sequence_number=0
wss://grafana.mydomain.com/api/live/ws
- I restart httpd (httpd -k restart) while still having my 3 tabs open (websocket connected)
- ⚠️ the newly created httpd fork is a ~25% sys and stay as it
- I attach debugger and find an infinite loop on the 3 threads on this stack:
ntdll!NtDeviceIoControlFile+0x14
MSWSOCK!MSAFD_WSPPoll+0x30e
MSWSOCK!WSPIoctl+0xed11()
WS2_32!WSAIoctl+0x1c7()
WS2_32!WSAPoll+0x22f
libapr_1!apr_winapi_WSAPoll+0x1b(pollfd * fdArray = 0x1778f464738) [C:\sdk\src\apr\include\arch\win32\apr_arch_misc.h @ 477]
libapr_1!impl_pollset_poll+0xc5(apr_pollset_t * pollset = 0x1778f4646c0, __int64 timeout = 0, int * num = 0x134bfffb78 : 0, apr_pollfd_t * * descriptors = 0x134bfffb60, apr_pollfd_t * * descriptors = 0x134bfffb60) [C:\sdk\src\apr\poll\unix\poll.c @ 263]
mod_http2!mplx_pollset_poll+0x136(h2_mplx * m = 0x1778f44c7d8, __int64 timeout = 0, void * on_ctx = 0x1778f44c160, void (__cdecl*)(void *,h2_stream *) on_stream_input = 0x44894c20244c894c : 0x44894c20244c894c, void (__cdecl*)(void *,h2_stream *) on_stream_output = 0x44894c20244c894c : 0x44894c20244c894c) [C:\sdk\src\mod_h2\mod_http2\h2_mplx.c @ 1210]
mod_http2!h2_mplx_c1_poll+0x40(h2_mplx * m = 0x1778f44c7d8) [C:\sdk\src\mod_h2\mod_http2\h2_mplx.c @ 637]
mod_http2!h2_session_process+0xce3(h2_session * session = 0x1778f44c160, int async = 610044232) [C:\sdk\src\mod_h2\mod_http2\h2_session.c @ 1911]
mod_http2!h2_c1_run+0xaf(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\mod_h2\mod_http2\h2_c1.c @ 132]
mod_http2!h2_c1_hook_process_connection+0x4e3(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\mod_h2\mod_http2\h2_c1.c @ 279]
libhttpd!ap_run_process_connection+0x35(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\httpd\server\connection.c @ 42]
libhttpd!ap_process_connection+0x57(void * csd = 0x1778f5002c0) [C:\sdk\src\httpd\server\connection.c @ 217]
libhttpd!worker_main+0x3a8(void * thread_num_val = 0xc5) [C:\sdk\src\httpd\server\mpm\winnt\child.c @ 846]
KERNEL32!BaseThreadInitThunk+0x1d()
ntdll!RtlUserThreadStart+0x28
...and to be clear about it...
diff --git "a/mod_http2/h2_mplx.c" "b/mod_http2/h2_mplx.c"
index 9f9ce80..1823d5c 100644
--- "a/mod_http2/h2_mplx.c"
+++ "b/mod_http2/h2_mplx.c"
@@ -1207,6 +1207,7 @@ static apr_status_t mplx_pollset_poll(h2_mplx *m, apr_interval_time_t timeout,
H2_MPLX_LEAVE(m);
rv = apr_pollset_poll(m->pollset, timeout >= 0? timeout : -1, &nresults, &results);
+ ap_log_cerror(APLOG_MARK, APLOG_ERR, rv, m->c1, APLOGNO(10310) H2_MPLX_MSG(m, "###### "));
H2_MPLX_ENTER_ALWAYS(m);
if (APR_STATUS_IS_EINTR(rv) && m->shutdown) {
if (!m->aborted) {
indeed, log show an infinite loop after graceful restart..
So
- did I configure something wrong? (If you need other information about my configuration, I will give it to you)
- if we need to investigate further, tell me what to do
- ⚠️ by starting (httpd -k start) with only changing to
H2Upgrade Off, I directly fall into the infinite loop
I am a bit confused. Does the browser make WebSocket connections over HTTP/2? It should not, if the server does not support it. Is this another request that produces the infinite loop? Strange.
Can you produce a log with LogLevel http2:trace2 on such a situation?
I am a bit confused
same for me cause my big picture understanding is that in this situation, h2_switch.c would have to decline and forward to mod_proxy_wstunnel....
Does the browser make WebSocket connections over HTTP/2
no, over HTTP/1.1
2024-04-02 17:52:06.551 proxy-server "GET /grafana/api/live/ws HTTP/1.1" 101 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0" sI:2676 sO:->0:-:- sT:3657 tI:- tO:24711096 ka:0-
Can you produce a log with LogLevel http2:trace2 on such a situation?
just have this relevant:
2024-04-02 17:52:06.551648 grafana.mydomain.com 127.0.0.1:50694 ZgwppqjK1zgVxTggMqo8iAAAAMM http2:debug h2_switch.c(92) pid:13140 tid:4048 ka:0 [AH03085: upgrade without HTTP2-Settings declined]
full trace after a graceful restart: http2-trace2.txt
Hi @icing, Continuing digging on it…
To be clear:
- Disabling load_module h2 or changing Protocols h2 http/1.1 > Protocols http/1.1 h2
- ✅ All works fine in http/1.1
- Remove upgrade=websocket in ProxyPass
- ✅ All works fine in http2 except websocket that are not upgraded
But as you said, I’m not sure that the infinite loop is originated by a gracefully restarted websocket cnx! As server-status scoreboard doesn’t show tid in mpm_winnt mode (only pid of main process) and as I don’t see issue in scoreboard nor broken cnx in browser network call stack, I suspect to have some old connection (from previous httpd forked process before graceful restart) to be stuck
- either on client side
brower <> httpd - or in backend side (no cnx exception in logs)
httpd <> [grafana|mattermost|gitlab]
As it’s difficult to work on Windows with tracing on a “kill & up process fork” and as I’m not fluent with APR, would it be possible to give a tmp patch (better than mine... easy ;) to trace which request are stuck in the infinite loop? The idea is to quickly have maximum of information (tid, request uri rtime, etc.) when it's stuck in do while greater than n time (like 1000 or whatever) and exit.
If mod_http2 is indeed polling in a loop, as your stacktrace indicates, something weird is going on. The trace is a call from h2_session.c:1911 where the session is in state H2_SESSION_ST_BUSY. When the poll returns, it calls h2_c1_read(session). That call should either get data from the client or move the session to another state.
I think we really need a log from such a case with LogLevel http2:debug to see what state changes the HTTP/2 session does.
Hi @icing,I'm back on it
Yesterday I put the http2:debug logs and reactivated the configuration Protocols h2 http/1.1 acme-tls/1
Surprise! No infinite loop has been triggered in the last 12 hours (and the usage /throughput is globally the same)
Originally, just had to wait few second after a graceful to see 1 to 4 thread in infinite loop stae regarding cpu usage
Changes since last month:
- I don't think that the debug logs inhibit the issue^^
- Updated the drivers of my network card (Intel 82599 10gbps: 4.1.251.0 > 4.1.254.0
- The only notable change is that the Encapsulated Packet Task Offload property is set to 0... think unrelated
- nghttp2 1.60 > 1.62.1 🤔
I'll keep you posted...
Thanks for the update. This is a mysterious one indeed.
Hi @icing, I think I've caught the loop Use case:
- start httpd
- launch browser tabs for mattermost, grafana, webmail & mainsite (4 different vhost proxing different backend expect mainsite)
[AH03079: h2_session(10292-0,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-1,INIT,0): started on webmail:443]
[AH03079: h2_session(10292-2,INIT,0): started on grafana:443]
[AH03079: h2_session(10292-3,INIT,0): started on mainsite:443]
[AH03079: h2_session(10292-5,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-6,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-7,INIT,0): started on grafana:443]
[AH03079: h2_session(10292-8,INIT,0): started on mainsite:443]
- all are well exchanging in h2
- mattermost & mrafana have a "101" upgrade websocket http1./1 active
mattermost & mrafana proxy config
RequestHeader set X-Forwarded-Proto "https"
RequestHeader set X-Forwarded-Ssl "on"
ProxyPass / http://backend/ upgrade=websocket retry=0 keepalive=On timeout=10
ProxyPassReverse / http://backend /
- few seconds later, a worker (Windows thread) is 100% cpu for ~1 minute
- waiting it to stop looping
- kill browser
- wait 5 minutes
- Server-status : http/1.1 websocket workers still in "W" Sending Reply
- Netstat : cnx httpd <> mattermost still in close wait
[httpd.exe] TCP httpd.ip:59709 mattermost.ip:8065 CLOSE_WAIT
[httpd.exe] TCP httpd.ip:59765 mattermost.ip:8065 CLOSE_WAIT
[httpd.exe] TCP httpd.ip:59795 mattermost.ip:8065 CLOSE_WAIT
[httpd.exe] TCP httpd.ip:59798 mattermost.ip:8065 CLOSE_WAIT
[httpd.exe] TCP httpd.ip:59799 mattermost.ip:8065 CLOSE_WAIT
[httpd.exe] TCP httpd.ip:59802 mattermost.ip:8065 CLOSE_WAIT
Here is the full 2 minutes httpd-error.log
Hard to read for me but... We see the 4 websocket connection declined [AH03085: upgrade without HTTP2-Settings declined]
So hypothesis that I don't control... Is there any chance that a pointer of these declined connections is still referenced in the h2_session and produce a "busy" zombie
I say this after reading the lines below which correspond almost to the moment when the infinite loop ended
2024-05-28 08:02:22.967261 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.429743 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.451635 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.527443 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
That's the context. If you need more traces, information... just tell me