haproxy icon indicating copy to clipboard operation
haproxy copied to clipboard

EPOLLHUP or EPOLLRDHUP during connection establishment leads to a wrong decision that a backend is down

Open gbrdead opened this issue 3 years ago • 51 comments

Detailed Description of the Problem

Sometimes (quite frequently under load) haproxy marks a backend as being down even though the liveness check (httpchk) is successful.

Expected Behavior

Backends should not be marked as down if they aren't really down.

Steps to Reproduce the Behavior

It is easy for me to reproduce it - just start about 10 simultaneous curl clients to pass requests continuously through haproxy at localhost. I have the feeling that this won't be easy for everybody, though...

Do you have any idea what may have caused this?

  1. After connect() is called on the socket that will be used for the liveness check, the socket file descriptor is added to an epoll descriptor. Occasionally (but frequently enough under load) epoll_wait() returns the following two events for the socket fd before the socket gets connected: EPOLLIN|EPOLLHUP|EPOLLRDHUP EPOLLIN|EPOLLRDHUP These two events are always returned by the same call to epoll_wait() and always in this order (as witnessed by strace).

  2. As a result of EPOLLHUP, _do_poll() in ev_epoll.c calls fd_update_events() with FD_EV_SHUT_W set. In the latter, the condition below the following comment: /* SHUTW reported while FD was active for writes is an error */ is true, and FD_POLL_ERR gets set in fdtab[fd].state. This flag never gets reset.

  3. I have no idea what this EPOLLHUP means for a not-yet-connected socket, but there is a comment in sock_conn_check() in sock.c that suggest that this is to be expected. The comment is long and starts with: /* Here we have 2 cases : Since FD_POLL_ERR is set, connect() gets called again. It returns 0 and the connection to the backend is established.

  4. The liveness check proceeds as usual - the request is sent and the response is received (as witnessed by strace).

  5. Eventually, raw_sock_to_buf() in raw_sock.c gets called. Check its very end, after the label read0 and remember that FD_POLL_ERR is still set in fdtab[conn->handle.fd].state. conn->flags gets CO_FL_ERROR set and this leads to the liveness check to be considered as failed.

Do you have an idea how to solve the issue?

I suspect that the trouble comes from FD_POLL_ERR not being reset before the second call to connect(). If I put the following line: fdtab[fd].state &= ~(FD_POLL_ERR|FD_POLL_HUP); in sock_conn_check() after the following comment: /* error present, fall through common error check path */ the issue seems to get solved.

What is your configuration?

global
    maxconn 50000

defaults
    log stdout local0 notice
    mode tcp
    maxconn 50000
    option redispatch
    option tcpka
    balance roundrobin
    default-server inter 2s fall 2 rise 2
    timeout connect 5s
    timeout client 50s
    timeout server 50s
    timeout tunnel 12h
    timeout http-keep-alive 1s
    timeout http-request 15s
    timeout queue 30s
    timeout tarpit 60s

backend bbb
    mode http
    option httpchk GET /liveness
    server ...

Output of haproxy -vv

HAProxy version 2.6.5-987a4e2 2022/09/03 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2027.
Known bugs: http://www.haproxy.org/bugs/bugs-2.6.5.html
Running on: Linux 5.4.0-122-generic #138~18.04.1-Ubuntu SMP Fri Jun 24 14:14:03 UTC 2022 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_STATIC_PCRE2=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DL=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL +THREAD +BACKTRACE -STATIC_PCRE +STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -ENGINE +GETADDRINFO +OPENSSL +LUA +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL -PROCCTL +THREAD_DUMP -EVPORTS -OT -QUIC -PROMEX -MEMORY_PROFILING

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=2).
Built with OpenSSL version : OpenSSL 1.1.1  11 Sep 2018
Running on OpenSSL version : OpenSSL 1.1.1  11 Sep 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.4.4
Built with network namespace support.
Support for malloc_trim() is enabled.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.40 2022-04-14
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 7.5.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : none

Available filters :
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace

Last Outputs and Backtraces

[WARNING]  (...) : Server bbb is DOWN, reason: Socket error, check duration: 2ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
...
[WARNING]  (...) : Backup Server bbb is DOWN, reason: Socket error, check duration: 2ms. 1 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Additional Information

No response

gbrdead avatar Sep 13 '22 15:09 gbrdead

Thanks for your report, will have a look at it ASAP, sorry for the delay.

wtarreau avatar Sep 16 '22 08:09 wtarreau

So a few points on this. First, if you get EPOLLHUP, it means the server closed, possibly with a response but not necessarily. Since it happened in response to a connect() and the FD was being polled for writes, it indicates the connect() call didn't succeed so the request wasn't sent. As such the response you're getting there is not the response to the health check but a failure reported by the server (or the server just closed because there were too many connections for it).

Regardless, haproxy should not return a misleading error here. It should accurately report what happened (and possibly any pending response the server might have sent, which could contain precious information about what the server was not happy with).

Your deep analysis of the sequence is particularly useful because, as you saw, it shows we're abusing FD_POLL_ERR. But that one must not be cleared (though in your test that was useful to confirm the cause), because a socket cannot be only temporarily in error, once in error, it's dead.

For me the problem here is that POLL_HUP was mapped to FD_EV_SHUT_RW since commit 6b3089856 ("MEDIUM: fd: do not use the FD_POLL_* flags in the pollers anymore") in 2.1, and I can't find any valid reason why. Oh yes I've just found the reason why. I have a length test log here from the same date as the commit showing that when RDHUP is reported, then HUP is reported only if the sending side was previously closed. Even if that's true in practice with data transfers, it's a bad idea to make this shortcut because as you've seen, connect() does poll for writes but doesn't try to transfer anything hence and doesn't have to suffer from this shortcut.

Individually, none of these operations are fundamentally wrong, as they help determine a maximum of information from what's reported by the poller. But I do think that we should not set POLL_ERR when the output is closed while it's being polled for writes. The first reason is that this is a sticky flag and that it was not really reported by the lower layer here so we can be wrong (as was the case here), the second reason is that if it's reported we can expect getsockopt(SO_ERROR) to return an error and here it's not the case. As such, I think that the best solution is to undo a part of commit 2aaeee34da, that is, to remove the assignment of FD_POLL_ERR in fd_update_events() below the comment saying "SHUTW reported while FD was active for writes is an error", because this is the one that guesses wrong here. Anyway its sole purpose was to avoid a few send() errors on aborted connections, but these are totally marginal. And if we decided that we needed to reintroduce it, then we'd adopt distinct poll-for-writes mechanisms, with one indicating we're polling for a connect() instead.

As such I would appreciate it if you could test with this:

--- a/src/fd.c
+++ b/src/fd.c
@@ -587,10 +587,6 @@ int fd_update_events(int fd, uint evts)
              ((evts & FD_EV_SHUT_R)  ? FD_POLL_HUP : 0) |
              ((evts & FD_EV_ERR_RW)  ? FD_POLL_ERR : 0);
 
-       /* SHUTW reported while FD was active for writes is an error */
-       if ((fdtab[fd].state & FD_EV_ACTIVE_W) && (evts & FD_EV_SHUT_W))
-               new_flags |= FD_POLL_ERR;
-
        /* compute the inactive events reported late that must be stopped */
        must_stop = 0;
        if (unlikely(!fd_active(fd))) {

Thank you!

wtarreau avatar Sep 17 '22 08:09 wtarreau

Thank you for the response and sorry for taking me so long to test your fix proposal.

I forgot to mention, but "Socket error" is not the only reason I see. There is one more that is very frequent: reason: Layer7 invalid response

Your fix proposal does not fix the "Layer7 invalid response" reason. It fixes only the "Socket error" reason. My pseudo-fix fixes both.

I sometimes get "Layer6 invalid response, info: SSL handshake failure (various errors here)" even with my pseudo-fix. However, these might be real - they are very infrequent and the way I reproduce the situation loads the CPU of the VM haproxy is running on. I am running haproxy itself and 10 concurrent curl's accessing a backend through haproxy at localhost.

gbrdead avatar Sep 27 '22 13:09 gbrdead

Interesting. That's stranger then because it would indicate that either the error is set at another point, and/or that there are incorrect checks at some point. The problem with clearing the error is that it will be lost and may seriously result in stuck connections or such situations where an error event is lost. Errors should never be cleared. I'll have a check at sock_conn_check(), maybe it sets the error while there are still pending data and it should not do it.

wtarreau avatar Sep 27 '22 15:09 wtarreau

I think I can explain this situation, too. Mind you, I haven't got the time to check the behaviour at runtime, I just analyzed the source. Here is a modification of the original description for the "Layer7 invalid response" case:

  1. Unlike in the original description, let's assume that we receive only the following event: EPOLLIN|EPOLLRDHUP (and that we do not receive the one with EPOLLHUP).

  2. As a result of EPOLLRDHUP, _do_poll() in ev_epoll.c calls fd_update_events() with FD_EV_SHUT_R set and FD_EV_SHUT_W unset. In the latter, FD_POLL_HUP gets set in fdtab[fd].state but FD_POLL_ERR stays unset.

  3. In sock_conn_check() in sock.c we reach the comment: /* error present, fall through common error check path */ (i.e., just like in the original description - we go to wait only if both FD_POLL_HUP and FD_POLL_ERR are unset). This explains why my pseudo-fix works for this case, too.

...

  1. raw_sock_to_buf() in raw_sock.c gets called. fdtab[conn->handle.fd].state & (FD_POLL_ERR|FD_POLL_HUP)) == FD_POLL_HUP is true upon entry to the function (unlike in the original description) and we go to read0. The response from the backend is never read and haproxy assumes an application level error.

gbrdead avatar Sep 27 '22 17:09 gbrdead

Thank you very much for your analysis. I don't like this, I'll have to recheck everything now. It's very hard for me to context-switch like this but this is needed. I think that the test on RDHUP that sets SHUT_R is a bit too early. Unless I'm mistaken, we only ought to report SHUT_R after we've read the last block (or HUP was reported without IN). Thus it definitely deserves a complete semantic recheck of what each flag promises and does not promise.

wtarreau avatar Sep 27 '22 18:09 wtarreau

Are there any news about a fix? Or is there a workaround, e.g. using a different I/O model (like the one in haproxy 1.8 which worked fine in our scenario)?

gbrdead avatar Nov 07 '22 13:11 gbrdead

Sorry, but we're currently burried alive on the haproxyconf, it will be finished in a few days.

wtarreau avatar Nov 07 '22 15:11 wtarreau

I'm getting progressively back on it again, now it's on the top of my pile of urgent stuff to anlayse before the release, thanks for your patience. It will take a while for me to re-digest the history (context-switching always being a nightmare). But I'd like to understand the whole sequence of events that causes this. Maybe it will require a much deeper fix that will not be suitable for a short-term but it needs to be sorted out.

wtarreau avatar Nov 16 '22 17:11 wtarreau

Hi @gbrdead,

I'm getting a better understanding of what's happening. The FD_POLL_HUP internal status has a non-uniform meaning due to the fact that RDHUP exists on some pollers and not others. As such, it is set whenever a pending shut is reported by EPOLLRDHUP and normalized into FD_EV_SHUT_R. And indeed, in sock_conn_check() HUP is sufficient to declare a connect() error, so if the server returns data and immediately closes, this could trigger RDHUP, then SHUTR, then the connect error. I suspect that this patch would silence it (it's not the fix though):

diff --git a/src/fd.c b/src/fd.c index f4f1bae81..a4ea96d77 100644 --- a/src/fd.c +++ b/src/fd.c @@ -584,7 +584,7 @@ int fd_update_events(int fd, uint evts) new_flags = ((evts & FD_EV_READY_R) ? FD_POLL_IN : 0) | ((evts & FD_EV_READY_W) ? FD_POLL_OUT : 0) |

  •         ((evts & FD_EV_SHUT_R)  ? FD_POLL_HUP : 0) |
    
  •           //((evts & FD_EV_SHUT_R)  ? FD_POLL_HUP : 0) |
            ((evts & FD_EV_ERR_RW)  ? FD_POLL_ERR : 0);
    
      /* SHUTW reported while FD was active for writes is an error */
    

Another short-term alternative that should work would be to replace the #ifndef EPOLLRDHUP block in ev_epoll.c around line 32 with #undef EPOLLRDHUP then #define EPOLLRDHUP 0.

But clearly it indicates that the internal FD_POLL_HUP sometimes means POLLHUP and sometimes POLLRDHUP. I need to audit all of it now to understand when it implies a shutw and when not. I'll likely need to add an extra state but I would like to know if I should complement it or replace it.

wtarreau avatar Nov 17 '22 11:11 wtarreau

Grrr sorry for the formating issue, let's try again:

index f4f1bae81..a4ea96d77 100644
--- a/src/fd.c
+++ b/src/fd.c
@@ -584,7 +584,7 @@ int fd_update_events(int fd, uint evts)
        new_flags =
              ((evts & FD_EV_READY_R) ? FD_POLL_IN  : 0) |
              ((evts & FD_EV_READY_W) ? FD_POLL_OUT : 0) |
-             ((evts & FD_EV_SHUT_R)  ? FD_POLL_HUP : 0) |
+               //((evts & FD_EV_SHUT_R)  ? FD_POLL_HUP : 0) |
              ((evts & FD_EV_ERR_RW)  ? FD_POLL_ERR : 0);
 
        /* SHUTW reported while FD was active for writes is an error */


wtarreau avatar Nov 17 '22 11:11 wtarreau

There is also something else I don't get. The only way I'm seeing for epoll_wait() to report EPOLLHUP on a connect without EPOLLERR is that the port is really closed. Indeed, the only other case is when the server closes without reading the client's data, which will carry both EPOLLERR and EPOLLHUP. However here if we have something to send it will only happen after connect() succeeds, hence I can't see any way that connect() could be informed of destroyed data. And I've been spending quite some time trying to reproduce this, even with servers that send a response and immediately close without reading the request, and can't figure a way to get EPOLLHUP while waiting for connect().

It would help me a lot if you could share your strace output.

wtarreau avatar Nov 17 '22 15:11 wtarreau

Not being able to reproduce it I can only guess, but my gut feeling is that in the end the patch below should be sufficient to address the problem and would be trivially backportable:

diff --git a/src/sock.c b/src/sock.c
index 1482c1a4f..2540d7d41 100644
--- a/src/sock.c
+++ b/src/sock.c
@@ -754,7 +754,7 @@ int sock_conn_check(struct connection *conn)
         * soon as we meet either of these delicate situations. Note that
         * SO_ERROR would clear the error after reporting it!
         */
-       if (cur_poller.flags & HAP_POLL_F_ERRHUP) {
+       if (0 && cur_poller.flags & HAP_POLL_F_ERRHUP) {
                /* modern poller, able to report ERR/HUP */
                if ((fdtab[fd].state & (FD_POLL_IN|FD_POLL_ERR|FD_POLL_HUP)) == FD_POLL_IN)
                        goto done;

It's the only place where we use HUP for what it really means (i.e. shut in both directions) and which corresponds to a connect() failure. However I'll really need your strace output, because the only way that code can cause a failure is by having the validating connect() return an unhandled errno, becausethe code above doesn't provoke an error, it just avoids double-checking connect() when it's known there was nothing suspicious. Thus if the code that follows reports the error, it's because errno doesn't match EALREADY, EINPROGRESS nor EISCONN.

wtarreau avatar Nov 17 '22 15:11 wtarreau

I can, to some extents, provoke random POLLHUP + POLLERR just from the server, particularly if it doesn't consume the request or closes after disabling lineging. But these ones are reported on the recv() call via the h1 mux. Actually the whole response is valid but there's still the CO_FL_ERROR flag on the connection (since the error was really present at the socket layer). There's still some ongoing work to make sure we don't use the connection errors from the upper layers anymore. Here for a test I could remove all of them and only rely on the stream's status, but that wasn't sufficient, some error cases were turned to timeouts, so probably a few places still rely on the hard error to interrupt some processing. But let's check for the strace output first before going into every direction.

wtarreau avatar Nov 17 '22 17:11 wtarreau

For the specific case where the server responds without draining the client's data, causing random RSTs to be emitted, and random up/downs to happen, the following patch proposed by Christopher addresses it by making sure we offer a chance to the mux to first parse a full response before seeing the RST (it will only see the RST if the response is not complete). For me it totally stabilizes such checks in both 2.6 and 2.7, under load as well as when there is no load.

diff --git a/src/raw_sock.c b/src/raw_sock.c
index e172b1d4e..af95c8283 100644
--- a/src/raw_sock.c
+++ b/src/raw_sock.c
@@ -329,7 +329,7 @@ static size_t raw_sock_to_buf(struct connection *conn, void *xprt_ctx, struct bu
         * of recv()'s return value 0, so we have no way to tell there was
         * an error without checking.
         */
-       if (unlikely(fdtab[conn->handle.fd].state & FD_POLL_ERR))
+       if (unlikely(!done && fdtab[conn->handle.fd].state & FD_POLL_ERR))
                conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
        goto leave;
 }

wtarreau avatar Nov 18 '22 09:11 wtarreau

FYI, I pushed the fix in 2.7-dev. I will be shipped with the 2.7-dev9.

capflam avatar Nov 18 '22 14:11 capflam

Here are the requested strace logs. I used haproxy 2.6.6.

strace log for the situation I explained on Sep 13: socket_error.strace.log

strace log for the situation I explained on Sep 27: layer7_invalid_response.strace.log

gbrdead avatar Nov 25 '22 11:11 gbrdead

Now about the fix in 2.7-dev10: It fixes only the situation I explained on Sep 27 - opposite to the fix proposed by wtarreau on Sep 17. Maybe you need to combine both fixes?

gbrdead avatar Nov 25 '22 11:11 gbrdead

Many thanks Vladimir! Your trace is amazing, it shows epoll_wait() reporting a hangup while this is not true, the subsequent send() works and the following recv() works as well to receive a response. That explains why your patch that masks these helps!

I have no idea how this is possible, for me it looks like epoll_wait() is lying. For example have you seen that recvfrom() after HUP+RDHUP which returns EAGAIN ? How is that possible at all ? It should return 0 indicating a recv shutdown. So it means we cannot trust these flags at all on your machine!

What exact kernel version are you using ? Because here the only possibility I'm seeing is to implement a workaround to ignore certain flags with a broken epoll implementation. I'd like to be able to detect this automatically, but we don't know how to trigger this situation though. But in the worst case we could imagine a global setting to enable such a workaround.

wtarreau avatar Nov 25 '22 12:11 wtarreau

The kernel version is: Linux 6749a9bb-63c7-4a1e-a968-1c1bd0d9266e 5.4.0-132-generic #148~18.04.1-Ubuntu SMP Mon Oct 24 20:41:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

The VM is running under Cloud Foundry BOSH. The cloud infrastructure is AWS but I have been able to reproduce it on Azure, too (I can also check GCP, if you wish).

gbrdead avatar Nov 25 '22 12:11 gbrdead

Thank you! There's no need to run in a different environment, the syscall is generic and shouldn't cause this. I cannot think of a single explanation for this polling state, but if it happens with such a generic kernel, we may need to implement a workaround.

Would it be possible to run tcpdump between haproxy and one server to see the exact sequence of network exchanges that produces this ? Do not hesitate to tell me if it's problematic to share such info. I'm trying to narrow down the issue to its exact root cause (maybe we can even reach a point where we could fix it in the kernel).

wtarreau avatar Nov 25 '22 12:11 wtarreau

Just to make sure not to waste your time, the output of tcpdump -Svvni ethX host $server_ip and tcp port $server_port will be sufficient. Preferably on the working case (with your patch of Sep13) since it shows the subsequent syscalls working while they should not. Ideally with the output of strace -tt in parallel but that is not mandatory since we already have your trace :-)

wtarreau avatar Nov 25 '22 13:11 wtarreau

It doesn't happen on my command. And if I apply my patch, I can no longer detect when it happens. Is it OK if I do it without the patch? The case in which the response is read (Socket error) is OK, isn't it?

gbrdead avatar Nov 25 '22 15:11 gbrdead

Yeah makes sense of course. I thought you could reproduce it easily out of production. Do not hesitate to contact me privately for the trace in order not to share private info if you want.

wtarreau avatar Nov 25 '22 16:11 wtarreau

Here is the "Socket error" case, both strace and tcpdump are in the same file.

socket_error.log

gbrdead avatar Nov 25 '22 17:11 gbrdead

Thank you, and good idea to have merged them into a single file: reordering them by time shows that epoll_wait() already responses with EPOLLIN|EPOLLHUP|EPOLLRDHUP as soon as the SYN-ACK comes back. The sequence of syscalls is so trivial that it should be reproducible. I do have access to a machine with the same distro and a reasonably similar kernel. I'll run some tests to try to reproduce it and figure the root cause. I may have to report it to netdev (thanks for having anonymized your traces, that will save me some time).

Now the next step might be to figure how to detect it at boot and fall back to a compatible mode. Oh by the way, could you please run with "haproxy -de" (or use noepoll in your global section) ? This will disable epoll and fall back to poll(). Both are expected to report the same set of flags. If it never fails with poll() we'll at least figure that the problem is with epoll. Otherwise it might be in the network stack.

wtarreau avatar Nov 25 '22 18:11 wtarreau

I ran haproxy with poll (i.e. with -de) for a whole night and there was not a single reproduction. In the same situation, epoll gets the problem reproduced within minutes. So we can safely assume that poll is not affected. In fact, I may even use poll. What is the downside, especially compared to haproxy 1.8?

gbrdead avatar Nov 26 '22 07:11 gbrdead

OK great, we're getting closer! The problem with poll() vs epoll() is that poll() will start to consume heavy CPU when you reach thousands of concurrent connections. But on lower loads it's not a problem at all.

wtarreau avatar Nov 26 '22 09:11 wtarreau

Hello @gbrdead

Could you please confirm that the server is a physically different machine here and not just a docker container or so ? I'm trying to set up a reproducer on a similar distro+kernel (for now I'm failing). Also could you please post the output of "lsmod" on your machine in case something gives me an idea ? For example we could imagine that one iptables rule or module is triggering the issue. Thanks!

wtarreau avatar Nov 28 '22 10:11 wtarreau

The machine that haproxy is running on and the machine that the backend is running on are different virtual machines. I have no way of knowing whether they run on different physical machines but I guess this doesn't matter.

lsmod on the haproxy machine: Module Size Used by binfmt_misc 24576 1 netlink_diag 16384 0 unix_diag 16384 0 xt_tcpudp 20480 4 iptable_mangle 16384 1 xt_cgroup 16384 3 bpfilter 24576 0 intel_rapl_msr 20480 0 intel_rapl_common 24576 1 intel_rapl_msr crct10dif_pclmul 16384 1 crc32_pclmul 16384 0 ghash_clmulni_intel 16384 0 aesni_intel 372736 0 crypto_simd 16384 1 aesni_intel cryptd 24576 2 crypto_simd,ghash_clmulni_intel glue_helper 16384 1 aesni_intel cirrus 16384 0 rapl 20480 0 drm_kms_helper 184320 3 cirrus drm 495616 3 drm_kms_helper,cirrus input_leds 16384 0 fb_sys_fops 16384 1 drm_kms_helper syscopyarea 16384 1 drm_kms_helper sysfillrect 16384 1 drm_kms_helper sysimgblt 16384 1 drm_kms_helper serio_raw 20480 0 mac_hid 16384 0 sch_fq_codel 20480 3 ip_tables 32768 1 iptable_mangle x_tables 45056 4 xt_cgroup,xt_tcpudp,ip_tables,iptable_mangle autofs4 45056 2 psmouse 151552 0 floppy 81920 0 i2c_piix4 28672 0 ixgbevf 77824 0 pata_acpi 16384 0

gbrdead avatar Nov 28 '22 15:11 gbrdead