httpd icon indicating copy to clipboard operation
httpd copied to clipboard

Backport to 2.4.x: mod_h2 to return c1 to mpm_event

Open ylavic opened this issue 1 year ago • 21 comments

*) mod_http2: sync with github module. Add async handling on newer httpd for blocked connections, fix yield handling. Trunk version of patch: https://svn.apache.org/r1918003 https://svn.apache.org/r1918022 https://svn.apache.org/r1918035 https://svn.apache.org/r1918078 https://svn.apache.org/r1918098 https://svn.apache.org/r1918099 https://svn.apache.org/r1918257 https://svn.apache.org/r1918482 https://svn.apache.org/r1918483 https://svn.apache.org/r1918491

ylavic avatar May 28 '24 14:05 ylavic

I'll check this out tomorrow.

icing avatar May 28 '24 14:05 icing

This works excellent. I added some check defines where we can adapt the MPM magic once we have the value for the backport:

diff --git a/modules/http2/h2.h b/modules/http2/h2.h
index f496a6dcb2..ae0e845704 100644
--- a/modules/http2/h2.h
+++ b/modules/http2/h2.h
@@ -49,6 +49,12 @@ struct h2_stream;
 #define H2_USE_WEBSOCKETS       0
 #endif

+#if AP_MODULE_MAGIC_AT_LEAST(20211221, 19)
+#define H2_USE_STATE_PROCESS       1
+#else
+#define H2_USE_STATE_PROCESS       0
+#endif
+
 /**
  * The magic PRIamble of RFC 7540 that is always sent when starting
  * a h2 communication.
diff --git a/modules/http2/h2_c1.c b/modules/http2/h2_c1.c
index 0a3e8395ad..6ba6dffb1b 100644
--- a/modules/http2/h2_c1.c
+++ b/modules/http2/h2_c1.c
@@ -161,7 +161,11 @@ apr_status_t h2_c1_run(conn_rec *c)
                      * the Timeout behaviour instead of a KeepAliveTimeout
                      * See PR 63534.
                      */
+#if H2_USE_STATE_PROCESS
                     c->cs->state = CONN_STATE_PROCESS;
+#else
+                    c->cs->state = CONN_STATE_WRITE_COMPLETION;
+#endif
                     c->cs->sense = CONN_SENSE_WANT_READ;
                 }
                 break;
diff --git a/modules/http2/h2_session.c b/modules/http2/h2_session.c
index 5add2ad2c1..e5809d59ac 100644
--- a/modules/http2/h2_session.c
+++ b/modules/http2/h2_session.c
@@ -1762,7 +1762,7 @@ static void unblock_c1_out(h2_session *session) {
     }
 }

-#if AP_MODULE_MAGIC_AT_LEAST(20211221, 19)
+#if H2_USE_STATE_PROCESS
 static int h2_send_flow_blocked(h2_session *session)
 {
     /* We are completely send blocked if either the connection window
@@ -1954,7 +1954,7 @@ apr_status_t h2_session_process(h2_session *session, int async,
                     break;
                 }
             }
-#if AP_MODULE_MAGIC_AT_LEAST(20211221, 19)
+#if H2_USE_STATE_PROCESS
             else if (async && h2_send_flow_blocked(session)) {
                 /* On a recent HTTPD, we can return to mpm c1 monitoring,
                  * as it does not treat all connections as having KeepAlive

I tested this with overwriting the definition using the following testcase in test_700_load_get.py:

    # test window sizes, connection and stream
    @pytest.mark.parametrize("connbits,streambits", [
        [10, 16],  # 1k connection window, 64k stream windows
        [10, 30],  # 1k connection window, huge stream windows
        [30, 8],  # huge conn window, 256 bytes stream windows
    ])
    def test_h2_700_20(self, env, connbits, streambits):
        assert env.is_live()
        n = 2000
        conns = 100
        parallel = 10
        args = [
            env.h2load,
            '-n', f'{n}', '-t', '1',
            '-c', f'{conns}', '-m', f'{parallel}',
            '-W', f'{connbits}',  # connection window bits
            '-w', f'{streambits}',  # stream window bits
            f'--connect-to=localhost:{env.https_port}',
            f'--base-uri={env.mkurl("https", "test1", "/")}',
            "/data-100k"
        ]
        r = env.run(args)
        self.check_h2load_ok(env, r, n)

This test fails quite reliably on a plain 2.4.x and succeeds with the changes in this PR.

icing avatar May 29 '24 07:05 icing

This works excellent. I added some check defines where we can adapt the MPM magic once we have the value for the backport:

Great! So I bumped the MMN minor in trunk for the new CONN_STATE_'s and added your change, backported here now so we should be good ;)

ylavic avatar May 29 '24 09:05 ylavic

Hm, something (related?) broke in the H2 tests (https://github.com/apache/httpd/actions/runs/9283938584/job/25545210350)..

ylavic avatar May 29 '24 10:05 ylavic

Hm, something (related?) broke in the H2 tests (https://github.com/apache/httpd/actions/runs/9283938584/job/25545210350)..

Hmm, is this a return to the MPM during a graceful shutdown? Does it then close a connection in PROCESS state?

The test case does a graceful restart of the server during a download...

icing avatar May 29 '24 10:05 icing

In case this is too complicated to "simply" fix in the mpm, I can make mod_http2 refrain from returning in case the server is stopping. There'd still be an edge case where the server gracefully shuts down with a connection being in processing already, but...

icing avatar May 30 '24 06:05 icing

@ylavic if mpm_event cannot handle it, how about this:

diff --git a/modules/http2/h2_session.c b/modules/http2/h2_session.c
index e5809d59ac..7618b64f77 100644
--- a/modules/http2/h2_session.c
+++ b/modules/http2/h2_session.c
@@ -1536,6 +1536,7 @@ static void h2_session_ev_ngh2_done(h2_session *session, int arg, const char *ms

 static void h2_session_ev_mpm_stopping(h2_session *session, int arg, const char *msg)
 {
+    session->stopping = TRUE;
     switch (session->state) {
         case H2_SESSION_ST_DONE:
             /* nop */
@@ -1955,12 +1956,15 @@ apr_status_t h2_session_process(h2_session *session, int async,
                 }
             }
 #if H2_USE_STATE_PROCESS
-            else if (async && h2_send_flow_blocked(session)) {
+            else if (async && !session->stopping && h2_send_flow_blocked(session)) {
                 /* On a recent HTTPD, we can return to mpm c1 monitoring,
                  * as it does not treat all connections as having KeepAlive
                  * timing and being purgeable on load.
                  * By returning to the MPM, we do not block a worker
-                 * and async wait for the client send window updates. */
+                 * and async wait for the client send window updates.
+                 * We would also like to do this during a graceful shutdown,
+                 * but the MPM event implementation currentyl does not support
+                 * this. */
                 ap_log_cerror(APLOG_MARK, APLOG_DEBUG, status, c,
                               H2_SSSN_LOG(APLOGNO(10502), session,
                               "BLOCKED, return to mpm c1 monitoring"));
diff --git a/modules/http2/h2_session.h b/modules/http2/h2_session.h
index 2c8f334cce..1d1e334b1c 100644
--- a/modules/http2/h2_session.h
+++ b/modules/http2/h2_session.h
@@ -84,6 +84,7 @@ typedef struct h2_session {

     unsigned int reprioritize  : 1; /* scheduled streams priority changed */
     unsigned int flush         : 1; /* flushing output necessary */
+    unsigned int stopping      : 1; /* MPM is stopping gracefully */
     apr_interval_time_t  wait_us;   /* timeout during BUSY_WAIT state, micro secs */

     struct h2_push_diary *push_diary; /* remember pushes, avoid duplicates */

icing avatar May 31 '24 07:05 icing

Sorry was busy$, I don't see why mpm_event would kill connections in PROCESS state. Let me try to reproduce..

ylavic avatar May 31 '24 10:05 ylavic

Isn't it possible that the connection is rather returned to the MPM in WRITE_COMPLETION=>KEEPALIVE state (not a zero-window test/case it seems)? Though in this case it should behave the same as before, was the test stable/reliable before this change? Also the test in trunk does not fail, but it might be a matter running it multiple times maybe, because 2.4.x is no different in this regard..

ylavic avatar May 31 '24 10:05 ylavic

So I could not reproduce on my laptop even with things like:

diff --git a/test/modules/http2/test_106_shutdown.py b/test/modules/http2/test_106_shutdown.py
index 83e143cef5..7cc6605cbb 100644
--- a/test/modules/http2/test_106_shutdown.py
+++ b/test/modules/http2/test_106_shutdown.py
@@ -54,7 +54,7 @@ class TestShutdown:
         # Create a low limit and only 2 children, so we'll encounter this easily
         conf = H2Conf(env, extras={
             'base': [
-                "ServerLimit 2",
+                "ServerLimit 1",
                 "MaxRequestsPerChild 3"
             ]
         })
@@ -62,11 +62,11 @@ class TestShutdown:
         conf.install()
         assert env.apache_restart() == 0
         url = env.mkurl("https", "test1", "/index.html")
-        for i in range(7):
+        for i in range(100):
             r = env.curl_get(url, options=['-v'])
             # requests should succeed, but rarely connections get closed
             # before the response is received
-            if r.exit_code in [16, 55]:
+            if False and r.exit_code in [16, 55]:
                 # curl send error
                 assert r.response is None
             else:

which has more chances to find the single child leaving at a bad timing.

But I think I see how this can happen. In the test_h2_106_02 case the IDLE c1 connection is returned to the MPM for WRITE_COMPLETION, thus it transitions to KEEPALIVE after the write is complete (or if there is nothing to write) and there it can be killed by the MPM on graceful restart. If a new request arrives in between the listener checking for readability and the kill this is the typical keepalive race condition. I don't think we should kill kept alive connections since they are not consuming much, instead we should set c->keepalive = AP_CONN_CLOSE before processing them the next time so that it's the last time, and let that go gracefully (KeepAliveTimeout is usually small like 5s max, so it shouldn't take to much to much to actually stop because of this). This is what is done in event WIP #294, I could try to upstream that part too..

ylavic avatar May 31 '24 15:05 ylavic

I don't think we should kill kept alive connections since they are not consuming much, instead we should set c->keepalive = AP_CONN_CLOSE before processing them the next time so that it's the last time,

Note that mod_h2 will have to do something to avoid the race condition still, like gracefully stopping the streams and everything. With HTTP/1 it's easier because the "last time" is a single transaction..

ylavic avatar May 31 '24 16:05 ylavic

Not sure if we could/should change mod_status output in 2.4, the options are:

  1. change the output like currently in this PR
  2. mpm_event to count for CONN_STATE_PROCESSING connexions in ps->write_completion
  3. same implicitely by not backporting CONN_STATE_PROCESSING either, i.e. use the WRITE_COMPLETION+clogged workaround

Thoughts?

Also, trunk (hence this PR) do not yet preserve keepalive connections on load (per my previous comment), thoughts on this too?

ylavic avatar Jun 18 '24 15:06 ylavic

  1. change the output like currently in this PR +1

-/-

    CONN_STATE_PROCESSING,          /* Processed by process_connection() hooks, returning
                                     * AGAIN to the MPM in this state will make it wait for
                                     * the connection to be readable or writable according to
                                     * CONN_SENSE_WANT_READ or CONN_SENSE_WANT_WRITE respectively,
                                     * where Timeout applies */

Minor not, it seems easy to misunderstand CONN_STATE_PROCESSING 180 degrees from name alone. Could we capture something more like the comment like CONN_STATE_WAITIO (don't love this either but hopefully it gets the point across)

covener avatar Jun 18 '24 15:06 covener

Minor not, it seems easy to misunderstand CONN_STATE_PROCESSING 180 degrees from name alone. Could we capture something more like the comment like CONN_STATE_WAITIO (don't love this either but hopefully it gets the point across)

Yes good point, the issue I think is that CONN_STATE_PROCESSING is both for when the connection is processed by the modules (set by MPM event before calling ap_run_process_connection()) and for when modules want to wait for IOs in the MPM. So maybe we want to keep CONN_STATE_PROCESSING for the former case only (it wouldn't be shown in mod_status like CONN_STATE_READ_REQUEST_LINE wasn't before) and have a new CONN_STATE_WAITIO or CONN_STATE_POLLED or CONN_STATE_ASYNC_WAITIO (_WAITIO does not look that bad to me btw) which mod_status accounts in its "asyncs"?

ylavic avatar Jun 18 '24 15:06 ylavic

All are fine. I like CONN_STATE_ASYNC_WAITIO best.

icing avatar Jun 18 '24 16:06 icing

That explains the naming! I think two states is best and the explicit ASYNC is makes sense too.

covener avatar Jun 18 '24 16:06 covener

Excellent work, @ylavic ! Looking forward to get that merged.🎉

icing avatar Jun 21 '24 13:06 icing

Thanks! Something to look at maybe though. Between d01f01829f145033dba9eb67c6ecc04a88f1db9c and aa643874dd6a9eb544c87509f3a44ec2da057297 the connection was erroneously closed when returned to MPM, and that seemed to cause a hang in the h2 tests using CONN_STATE_ASYNC_WAITIO. This is fixed by the last commit but I'm wondering what would happen if for instance the connection times out in the MPM (which would also cause an async close, though later). I didn't look at what happened exactly so I can't tell if the hang was in mod_h2 or in the test client. It just surprised me so maybe there is something to look at and/or cleanup from the pre_close_connection hook still should an async close happen, this is actually the ultimate callback for mod_h2 to forget about the h1 and potentially all the related h2s if needed..

ylavic avatar Jun 21 '24 14:06 ylavic

Wait, it seems the issue is in the lingering close itself..

ylavic avatar Jun 21 '24 14:06 ylavic

Which test may hang/timeout in the mpm? Any way to reproduce?

icing avatar Jun 21 '24 14:06 icing

All right, so it was just caused by d01f01829f145033dba9eb67c6ecc04a88f1db9c reaching ap_process_lingering_close() with an unexpected cs->pub.state == CONN_STATE_ASYNC_WAITIO, so the lingering close wasn't nonblocking anymore... It should not happen now after aa643874dd6a9eb544c87509f3a44ec2da057297 but I added the CONN_STATE_IS_LINGERING_CLOSE() macro in b40ccd9ab2fa0e8148b39dee7e44fd5f3f4c515b to avoid any further mistake like this. Should be good now ;)

ylavic avatar Jun 21 '24 15:06 ylavic

Backported in r1919548

ylavic avatar Jul 27 '24 14:07 ylavic