haproxy-ingress Reload seems to sporadically reset running connections

Description of the problem

Please note that this the same environment as in #869, though different problem.

Clients sporadically get TCP connections reset. This happens in a large cluster with ~4800 ingresses. There high volume of new incoming connections and also a high volume of long-lived connections (hours). There are sometimes event bursts w.r.t to ingresses, e.g. many new ingresses / deleted ingresses / changed ingresses and then the frequency of the resets increases.

We have captured TCP dumps on various points and thereby found out that it is the haproxy-ingress pod that terminates the connection. As an example, we have a connection opened at 13:59:43 UTC and terminated at 14:00:04 UTC. 100.97.13.19 is the IP of the haproxy-ingress pod and 100.96.227.246 that of the server. We see this: There is regular traffic up until 13:59:58, then nothing for six seconds and then a FIN/ACK is sent to the server. The server does not expect this and replies was a RST/ACK. The FIN/ACK is not coming from the client (where we have also capture dumps) The client just gets the last ACK and then the RST/ACK (client IP redacted):

This does not really match with any of our timeout settings (see details below) and also the connection is very new.

In the logs of the haproxy-ingress we see a few updates, ingress events and then a reload of the embedded haproxy in the time of the last regular packets:

I0202 13:59:43.324320       6 controller.go:317] starting haproxy update id=29569
I0202 13:59:43.324456       6 instance.go:310] old and new configurations match
I0202 13:59:43.324485       6 controller.go:326] finish haproxy update id=29569: parse_ingress=0.063531ms write_maps=0.008127ms total=0.071658ms
...
I0202 13:59:49.324423       6 controller.go:317] starting haproxy update id=29572
I0202 13:59:49.338207       6 instance.go:307] haproxy updated without needing to reload. Commands sent: 3
I0202 13:59:49.338271       6 controller.go:326] finish haproxy update id=29572: parse_ingress=0.206048ms write_maps=0.099833ms write_config=13.445251ms total=13.751132ms	
I0202 13:59:49.619040       6 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"f614705f-51e0-4978-8a35-7642d80a5632-0", UID:"b7f2afb4-d9e0-4e80-934b-95dd6059595b", APIVersion:"networking.k8s.io/v1", ResourceVersion:"2061866396", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress default/f614705f-51e0-4978-8a35-7642d80a5632-0
...
I0202 13:59:55.589288       6 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"9453efbd-b5f3-4a4f-b883-fa582a451f36", UID:"7ca24a16-6dc5-4d19-8ab0-defed1c27afc", APIVersion:"networking.k8s.io/v1", ResourceVersion:"2061866871", FieldPath:""}): type: 'Normal' reason: 'DELETE' Ingress default/9453efbd-b5f3-4a4f-b883-fa582a451f36
...
I0202 13:59:58.263516       6 controller.go:334] starting haproxy reload id=633
W0202 13:59:58.732242       6 instance.go:501] output from haproxy:
I0202 13:59:58.732511       6 instance.go:337] haproxy successfully reloaded (embedded)
I0202 13:59:58.732564       6 controller.go:338] finish haproxy reload id=633: reload_haproxy=468.960356ms total=468.960356ms

Note that the last regular packet from haproxy to server is recorded at 13:59:58.718081 and the reload is logged in the period of 13:59:58.263516 and 13:59:58.732564.

Because of the timing, we suspect that something goes wrong during the reload and a haproxy process gets terminated though it should not (yet).

The pod is under considerable load, i.e. ~36 CPUs in usage (nbthread is 63), ~40GiB memory, ~280 internal processes running. The number of processes also changes suggesting that some get terminated.

Expected behavior

Connections should be kept alive. The embedded haproxy should not terminate them.

Steps to reproduce the problem

We are working on reproducing it in a small environment, but cannot yet confirm. This is what we are doing:

Have a number of backend servers, some ingresses, one haproxy-ingress pod
Have a loop constantly doing ingress changes to trigger reloads
Have a set of clients constantly opening connections and keeping them open, doing simple data transmission

Environment information

HAProxy Ingress version: v0.13.4

Command-line options:

    - --default-backend-service=$(POD_NAMESPACE)/ingress-default-backend
    - --default-ssl-certificate=$(POD_NAMESPACE)/ingress-default-backend-tls
    - --configmap=$(POD_NAMESPACE)/haproxy-configmap-nlb
    - --reload-strategy=reusesocket
    - --alsologtostderr
    - --reload-interval=60s
    - –wait-before-update=10s
    - --backend-shards=50

From the configmap:

  config-frontend: |
    maxconn 30000
    option clitcpka
    option contstats
  config-global: |
    pp2-never-send-local
  health-check-interval: 10s
  https-log-format: '%H\ %ci:%cp\ [%t]\ %ft\ %b/%s/%bi:%bp\ %Th/%Tw/%Tc/%Td/%Tt\ %B\
    %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq'
  max-connections: "30100"
  nbthread: "63"
  slots-min-free: "0"
  timeout-client: 50s
  timeout-client-fin: 50s
  timeout-tunnel: 24h

Global options:

global
    daemon
    unix-bind mode 0600
    nbthread 63
    cpu-map auto:1/1-63 0-62
    stats socket /var/run/haproxy/admin.sock level admin expose-fd listeners mode 600
    maxconn 30100
    tune.ssl.default-dh-param 2048
    ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
    ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
    ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
    ssl-default-server-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
    ssl-default-server-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
    pp2-never-send-local

defaults
    log global
    maxconn 30100
    option redispatch
    option http-server-close
    option http-keep-alive
    timeout client          50s
    timeout client-fin      50s
    timeout connect         5s
    timeout http-keep-alive 1m
    timeout http-request    5s
    timeout queue           5s
    timeout server          50s
    timeout server-fin      50s
    timeout tunnel          24h

Ingress objects (parts redacted / changed):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/config-backend: |
      acl network_allowed src 0.0.0.0/0
      tcp-request content reject if !network_allowed
    ingress.kubernetes.io/maxconn-server: "1000"
    ingress.kubernetes.io/proxy-protocol: v2
    ingress.kubernetes.io/ssl-passthrough: "true"
    kubernetes.io/ingress.class: haproxy
.....
  name: some-uid
  namespace: default
.....
spec:
  rules:
  - host: some-uid-********.com
    http:
      paths:
      - backend:
          service:
            name: some-uid
            port:
              number: 1234
        path: /
        pathType: ImplementationSpecific
status:
  loadBalancer:
    ingress:
    - {}

Resulting backend config:

backend default_**********************
    mode tcp
    balance roundrobin
    option srvtcpka

    acl network_allowed src 0.0.0.0/0
    tcp-request content reject if !network_allowed
    server srv001 100.96.227.246:1234 weight 1 maxconn 1000 send-proxy-v2 check inter 10s

Feb 03 '22 14:02 lenhard

Hi, I'm in the process of building an environment to stress test the controller and the proxy, and your scenario and configuration details are perfect to guide the first tests. A few doubts below:

Would you say the ingress snippet above is similar with the majority of the ingress in the environment? I'll try to configure my test env as much similar with yours as possible;
Is there any chance to update your haproxy version, either updating the controller or trying to use the external deployment? I remember to have seen in the past some of the old instances crashing just after the restart, you can see this via kernel log, dmesg, etc, or maybe in the console of the vm depending on your distro. A haproxy crash should justify the connection resets you're seeing and maybe it's already fixed on newer versions.

Feb 09 '22 22:02 jcmoraisjr

Thanks much for the feedback! If I can supply you with more information that would help you for the stress tests, let me know.

Would you say the ingress snippet above is similar with the majority of the ingress in the environment? I'll try to configure my test env as much similar with yours as possible;

Yes, basically all are of the kind and there are ~4500 of them. There are also ~200 ingresses for routing HTTPS and they will have a section with:

  tls:
  - hosts:
    - some-uid-********.com
    secretName: some-secret-tunnel-cert

but traffic for those is negligible.

Is there any chance to update your haproxy version, either updating the controller or trying to use the external deployment? I remember to have seen in the past some of the old instances crashing just after the restart, you can see this via kernel log, dmesg, etc, or maybe in the console of the vm depending on your distro. A haproxy crash should justify the connection resets you're seeing and maybe it's already fixed on newer versions.

In the affected cluster, we are running haproxy-ingress v0.13.4 (embedded haproxy 2.3.14). So at best we could go to controller v0.13.6 (embedded haproxy 2.3.17). The external deployment would mean a major refactoring for us unfortunately. I was hoping to find some form of shorter term mitigation. Looking at the changelog for 2.3 I am not spotting something that reads like a fixed crash bug in versions 2.3.15 - 2.3.17. Also I could not find an indication of a crash in the syslogs I have been looking at, but maybe there are more log sources I should have been looking at. I will keep digging.

Do you think that the reload strategy could have an impact? We are using reusesocket and switching to native is still under discussion at our end. It does have the known caveat of the small time window of blocked connections during reload, which is why we are still refraining from it. I just wonder if the socket handover might be a root cause for the connection reset we are seeing. But this is just speculation on my end, if you have more insights I'd appreciate it.

Feb 11 '22 15:02 lenhard

Thanks for the update. Regarding the proxy version you have also the option to build a new image yourself but migrate to v0.13.6 should be safe enough. Regarding possible errors some proxy instances might also lead to a oomk, since you reported that you have lots of reloads and old instances remains alive due to long lived connections, but I think you'd easily find them, either via syslog or monitoring host's memory usage. Regarding the reload strategy I think that the problems you could have copying the listener sockets would lead to messages like described in #696, but it's just some speculation from my side and maybe there are some more side effects that I'm not aware of. Depending on the SLA you have and how resilient your clients are, changing reload to native might be a reasonable temporary option, just to figure out if you continue to have the same kind of issues. Maybe in a month or so I have some numbers and can also improve something from the controller side, but in the mean time I think it worth to format a question and send it to the haproxy mailing list. Of course controller logs won't be helpful there but this can be translated to the moment the proxy was reloaded.

Feb 12 '22 19:02 jcmoraisjr

It has been a while, but I would still like to provide a short update on this ticket. From my point of view, setting a reload-interval resolved it. The resets happened in situations when there were burst events of many ingress changes which then triggered many reloads (in the order of several per second). Limiting the frequency of reloads made this problem go away.

Sep 21 '22 11:09 lenhard

haproxy-ingress haproxy-ingress copied to clipboard

Reload seems to sporadically reset running connections

haproxy-ingress
haproxy-ingress copied to clipboard