kubernetes-ingress icon indicating copy to clipboard operation
kubernetes-ingress copied to clipboard

hap-ingress in external mode not updating certificates on disk (sporadic)

Open joachimbuechse opened this issue 6 months ago • 4 comments

Setup:

  • Two instances of hap-ingress (3.1.6/3.1.7) running on debian serving as (redundant) external ingress controllers for k8s.
  • Identical setup / versions of both instances
  • Pipeline updating certificates by replacing (updating) tls secrets issued by letsencrypt in k8s

Problem:

  • update of tls secret in k8s leads to replacement of the runtime certificate in haproxy (good) but not of the file on disk (problem).
  • the problem happens sporadically only (i.e. sometimes both hapi instances replace the disk file, sometimes only one does)

Analysis:

This happens only sporadically. I can see that another cert was successfully updated on disk by both hapi instances just 2 days before. No errors / warnings in the hapi logs.

The disk file is replaced only on one instance

Checked with

ls -ltr /opt/haproxy-ingress/config/certs/frontend/
openssl x509 -enddate -noout -in certs/frontend/haproxy-ingress_epg.[snip].pem

The runtime cert is replaced on both instances

Checked with

openssl s_client -connect [snip]:443 -servername epg.[snip] < /dev/null 2>/dev/null | openssl x509 -noout -enddate -subject

The (defunct) instance is no longer returning a valid certificate chain (i.e. issuer cert is missing)

Checked with

openssl s_client -connect [snip]:443 -servername epg.[snip] < /dev/null 2>/dev/null OK instance:

depth=2 C=US, O=Internet Security Research Group, CN=ISRG Root X1
verify return:1
depth=1 C=US, O=Let's Encrypt, CN=R10
verify return:1
depth=0 CN=epg.[snip]
verify return:1
---
Certificate chain
 0 s:CN=epg.[snip]
   i:C=US, O=Let's Encrypt, CN=R10
   a:PKEY: RSA, 2048 (bit); sigalg: sha256WithRSAEncryption
   v:NotBefore: May 21 06:40:13 2025 GMT; NotAfter: Aug 19 06:40:12 2025 GMT
 1 s:C=US, O=Let's Encrypt, CN=R10
   i:C=US, O=Internet Security Research Group, CN=ISRG Root X1
   a:PKEY: RSA, 2048 (bit); sigalg: sha256WithRSAEncryption
   v:NotBefore: Mar 13 00:00:00 2024 GMT; NotAfter: Mar 12 23:59:59 2027 GMT

Defunct instance:

depth=0 CN=epg.[snip]
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN=epg.[snip]
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 CN=epg.[snip]
verify return:1
---
Certificate chain
 0 s:CN=epg.[snip]
   i:C=US, O=Let's Encrypt, CN=R10
   a:PKEY: RSA, 2048 (bit); sigalg: sha256WithRSAEncryption
   v:NotBefore: May 21 06:40:13 2025 GMT; NotAfter: Aug 19 06:40:12 2025 GMT

Clueless as to how to debug this further. Any help appreciated.

joachimbuechse avatar May 22 '25 07:05 joachimbuechse

Hello @joachimbuechse ,

The certificates are written on disk only if there a haproxy reload. The runtime certificates are updated in any case, and it seems it's ok for you. It's possible that on the instance where the certificate is not updated on disk, there was no reload. You could check if that's the case.

There is also a startup --disable-writing-only-if-reload flag that disable this behaviour and always update the certificates on disk immediatly (even if no reload of haproxy is happening).

hdurand0710 avatar May 22 '25 07:05 hdurand0710

I've updated the original ticket description as I missed the most important part in the ticket. When this situation happens, hapi does not return a full certificate chain but only the site cert itself without the issuer certificate.

joachimbuechse avatar May 22 '25 07:05 joachimbuechse

I've tested with --disable-writing-only-if-reload and without. The option is apparently not required for certificate writing. I just pushed an older tls secret. Two instances of hapi one configured with the option

[[email protected]]:/opt/haproxy-ingress/config/certs/frontend # ps -edf | grep haproxy-ingress-controller
root     2021194       1  0 10:04 ?        00:00:01 /usr/local/bin/haproxy-ingress-controller --program=/usr/sbin/haproxy --config-dir=/opt/haproxy-ingress/config --maps-dir=/opt/haproxy-ingress/config/maps --disable-writing-only-if-reload --runtime-dir=/tmp/haproxy-ingress --external --ingress.class=public-ingress --configmap=haproxy-ingress/public-ingress --default-backend-service=default/deny-all --ipv4-bind-address=[snip]--disable-ipv6 --http-bind-port=80 --https-bind-port=443 --stats-bind-port=8404

the other without

[root@slb-ms08]:/opt/haproxy-ingress/config/certs/frontend # ps -edf | grep ingress
root     1503906       1  0 Apr08 ?        00:21:05 /usr/local/bin/haproxy-ingress-controller --program=/usr/sbin/haproxy --config-dir=/opt/haproxy-ingress/config --maps-dir=/opt/haproxy-ingress/config/maps --runtime-dir=/tmp/haproxy-ingress --external --ingress.class=public-ingress --configmap=haproxy-ingress/public-ingress --default-backend-service=default/deny-all --ipv4-bind-address=[snip] --disable-ipv6 --http-bind-port=80 --https-bind-port=443 --stats-bind-port=8404

Both updated the file in /opt/haproxy-ingress/config/certs/frontend/.

I can't yet say if the option is making a difference in terms of reliability of writing as this issue happens sporadically only for us.

joachimbuechse avatar May 22 '25 08:05 joachimbuechse

Problem happened again with option --disable-writing-only-if-reload enabled and hap-ingress 3.1.7

joachimbuechse avatar Jun 02 '25 13:06 joachimbuechse

This issue may be broader in scope than certificates. My team recently experienced an issue where many services became unreachable across our infrastructure in a way that was consistent with HAProxy reloading deeply stale backend-to-IP mappings from disk. Unfortunately, in our urgency to resolve the production incident, we were unable to collect much supporting data. We'll be following up on that, but hopefully this is useful as a single data point in the interim.

As the incident occurred a few days after an upgrade from 3.0.4 to 3.1.7, we're scrutinizing the parallel filesystem write changes around https://github.com/haproxytech/kubernetes-ingress/commit/1b84fd904b5ff75795f8294552f0a08621f9ef53 and https://github.com/haproxytech/kubernetes-ingress/commit/38f57e8332781ab0e230597655beac4270d88f87 as starting points for our internal investigation. Again, until we have a definitive repro case, there's no data to directly implicate these changes, but they seem like the area where this sort of bug would be most likely to be introduced.

jgoldschrafe avatar Jun 30 '25 13:06 jgoldschrafe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 30 '25 14:07 stale[bot]

Sad that this ticket gets no attention, we have rolled back to version 3.0.9 to circumvent the issue.

joachimbuechse avatar Aug 29 '25 17:08 joachimbuechse

@oktalz

Version used before: 3.1.11

We are encountering the same issue. Downgrading to 3.0.9 helped.

After a certificate change, Controller reloads correctly:

haproxy-kubernetes-ingress-qlrkz kubernetes-ingress-controller 2025/09/09 15:00:53 INFO    controller/controller.go:216 [transactionID=ada9744a-3f01-4cd8-b973-c81ac9916221] HAProxy reloaded
haproxy-kubernetes-ingress-9ktn2 kubernetes-ingress-controller 2025/09/09 15:00:53 INFO    controller/controller.go:216 [transactionID=68bc9d5e-03c5-4e69-83c2-cccd4feced03] HAProxy reloaded
haproxy-kubernetes-ingress-hxkhg kubernetes-ingress-controller 2025/09/09 15:00:50 INFO    controller/controller.go:216 [transactionID=dfb15282-ee77-44b8-8186-bf7e09347d8f] HAProxy reloaded
haproxy-kubernetes-ingress-slbzv kubernetes-ingress-controller 2025/09/09 15:00:52 INFO    controller/controller.go:216 [transactionID=a6518da7-861b-4d8b-8a1a-d516786170d0] HAProxy reloaded
haproxy-kubernetes-ingress-xt2gb kubernetes-ingress-controller 2025/09/09 15:00:54 INFO    controller/controller.go:216 [transactionID=e0e5b2d0-43fb-468b-abe7-a17622fdf997] HAProxy reloaded

Files on disk:

$ for pod in $(kubectl get pods -n haproxy -o name); do kubectl exec $pod -n haproxy -- ls -lt /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem ; done
-rw-rw-rw-    1 haproxy  haproxy       2696 Sep  9 14:51 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
-rw-rw-rw-    1 haproxy  haproxy       2696 Sep  9 14:51 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
-rw-rw-rw-    1 haproxy  haproxy       2696 Sep  9 14:51 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
-rw-rw-rw-    1 haproxy  haproxy       2700 Sep  9 14:40 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
-rw-rw-rw-    1 haproxy  haproxy       2700 Sep  9 14:40 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem

But served certificate is still old on some haproxy pods:

❯ for pod in $(kubectl get pods -n haproxy -o name); do
  node=$(kubectl get "$pod" -n haproxy -o jsonpath='{.spec.nodeName}')
  echo "=== $pod on node: $node ==="
  kubectl exec -n haproxy "$pod" -- \
    openssl s_client -connect localhost:8443 -servername test.local < /dev/null \
    | openssl x509 -noout -dates
  echo ""  # Blank line for readability
done

=== pod/haproxy-kubernetes-ingress-9ktn2 on node: default-145b0-ywdhn ===
notBefore=Sep  9 12:51:10 2025 GMT
notAfter=Dec  8 12:51:10 2025 GMT

=== pod/haproxy-kubernetes-ingress-hxkhg on node: default-145b0-dhixv ===
notBefore=Sep  9 12:51:10 2025 GMT
notAfter=Dec  8 12:51:10 2025 GMT

=== pod/haproxy-kubernetes-ingress-qlrkz on node: default-145b0-lvqlo ===
notBefore=Sep  9 12:51:10 2025 GMT
notAfter=Dec  8 12:51:10 2025 GMT

=== pod/haproxy-kubernetes-ingress-slbzv on node: default-145b0-anmca ===
notBefore=Sep  9 12:40:41 2025 GMT
notAfter=Dec  8 12:40:41 2025 GMT

=== pod/haproxy-kubernetes-ingress-xt2gb on node: default-145b0-pvoli ===
notBefore=Sep  9 12:40:41 2025 GMT
notAfter=Dec  8 12:40:41 2025 GMT

zadjadr avatar Sep 10 '25 11:09 zadjadr

@zadjadr we will try to figure out what is happening, just because we weren't able to reproduce it does not mean its working as it should.

I'll keep this open and try to see if we can do something

oktalz avatar Sep 10 '25 13:09 oktalz

@oktalz thanks a lot. If there is anything I can try out, Test or provide more info on, please tell me.

zadjadr avatar Sep 10 '25 15:09 zadjadr

This issue may be broader in scope than certificates. My team recently experienced an issue where many services became unreachable across our infrastructure in a way that was consistent with HAProxy reloading deeply stale backend-to-IP mappings from disk. Unfortunately, in our urgency to resolve the production incident, we were unable to collect much supporting data. We'll be following up on that, but hopefully this is useful as a single data point in the interim.

As the incident occurred a few days after an upgrade from 3.0.4 to 3.1.7, we're scrutinizing the parallel filesystem write changes around https://github.com/haproxytech/kubernetes-ingress/commit/1b84fd904b5ff75795f8294552f0a08621f9ef53 and https://github.com/haproxytech/kubernetes-ingress/commit/38f57e8332781ab0e230597655beac4270d88f87 as starting points for our internal investigation. Again, until we have a definitive repro case, there's no data to directly implicate these changes, but they seem like the area where this sort of bug would be most likely to be introduced.

Already mentioned here https://github.com/haproxytech/kubernetes-ingress/issues/702#issuecomment-2891997594

Quite the same issue with backend reload. If too many pod restart happened in short amount of time the backend will have 0 server at the end.

Looking at the haproxy.cfg in the everything is correct with target and pod_ip, but the process itself didn't reload the configuration and he redirect to nothing with an empty backend.

A rollout restart is an off to fix the problem, but we got a lot of production incident because of that.

And it's happen since version 2.X

mdecalf avatar Sep 12 '25 13:09 mdecalf

Here is how I reproduced it:

  • Have cert-manager or similar installed in the cluster
  • Have a selfsigned issuer in the cluster (to not annoy letsencrypt)
  • Deploy Haproxy Ingress Controller 3.1.11 (we use Daemonset)
  • Deploy a simple test deployment with service, ingress
  • Create a Certificate object & then delete and let it recreate/update the secret
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
  namespace: test-haproxy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "hello"
  template:
    metadata:
      labels:
        app: "hello"
    spec:
      containers:
      - name: hello
        image: hashicorp/http-echo
        args:
          - "-text=Hello from HAProxy!"
        ports:
        - containerPort: 5678
---
apiVersion: v1
kind: Service
metadata:
  name: hello
  namespace: test-haproxy
spec:
  ports:
  - port: 80
    targetPort: 5678
  selector:
    app: "hello"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hello-ingress
  namespace: test-haproxy
  annotations:
    cert-manager.io/cluster-issuer: selfsigned
spec:
  ingressClassName: haproxy
  tls:
  - hosts:
      - test.local
    secretName: test-cert-secret
  rules:
  - host: test.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: hello
            port:
              number: 80
# cert-for-test.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: test-cert
  namespace: test-haproxy
spec:
  secretName: test-cert-secret
  issuerRef:
    name: selfsigned
    kind: Issuer
  dnsNames:
    - test.local
  duration: 1h
  renewBefore: 5m

Check the certificate

Add your LOADBALANCER_IP

watch -n 1 'openssl s_client -connect ${LOADBALANCER_IP}:443 -servername test.local < /dev/null \
  | openssl x509 -noout -dates'

Force certificate rotation

kubectl delete secret test-cert-secret -n test-haproxy 
kubectl delete certificate test-cert -n test-haproxy 
kubectl apply -f cert-for-test.yaml

Check which pods return which certificate

for pod in $(kubectl get pods -n haproxy -o name); do
  node=$(kubectl get "$pod" -n haproxy -o jsonpath='{.spec.nodeName}')
  echo "=== $pod on node: $node ==="
  kubectl exec -n haproxy $pod -- ls -lt /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem

  kubectl exec -n haproxy "$pod" -- \
    openssl s_client -connect localhost:8443 -servername test.local < /dev/null \
    | openssl x509 -noout -dates
  echo ""  # Blank line for readability
done

Get secrets certificate timestamp

kubectl get secret test-cert-secret -n test-haproxy  -o yaml| grep Time
  creationTimestamp: "2025-09-09T12:51:10Z"

zadjadr avatar Sep 15 '25 07:09 zadjadr

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 15 '25 08:10 stale[bot]

We have also experienced this problem since an upgrade during spring time in one cluster. The problem has happended three times, the last one today. The certificate and the secret created by cert-manager is all fine, but haproxy is using an older one. The solution is to force a rolling update of the pods. When they are restarted, everything is OK. To only reload configuration is not enough.

oekarlsson avatar Oct 15 '25 19:10 oekarlsson

We have also experienced this problem since an upgrade during spring time in one cluster. The problem has happended three times, the last one today. The certificate and the secret created by cert-manager is all fine, but haproxy is using an older one. The solution is to force a rolling update of the pods. When they are restarted, everything is OK. To only reload configuration is not enough.

Notice: We are not running in external mode. We installed using the helm chart with some custom configuration.

oekarlsson avatar Oct 16 '25 06:10 oekarlsson

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 15 '25 11:11 stale[bot]