kubernetes-ingress
kubernetes-ingress copied to clipboard
hap-ingress in external mode not updating certificates on disk (sporadic)
Setup:
- Two instances of hap-ingress (3.1.6/3.1.7) running on debian serving as (redundant) external ingress controllers for k8s.
- Identical setup / versions of both instances
- Pipeline updating certificates by replacing (updating) tls secrets issued by letsencrypt in k8s
Problem:
- update of tls secret in k8s leads to replacement of the runtime certificate in haproxy (good) but not of the file on disk (problem).
- the problem happens sporadically only (i.e. sometimes both hapi instances replace the disk file, sometimes only one does)
Analysis:
This happens only sporadically. I can see that another cert was successfully updated on disk by both hapi instances just 2 days before. No errors / warnings in the hapi logs.
The disk file is replaced only on one instance
Checked with
ls -ltr /opt/haproxy-ingress/config/certs/frontend/
openssl x509 -enddate -noout -in certs/frontend/haproxy-ingress_epg.[snip].pem
The runtime cert is replaced on both instances
Checked with
openssl s_client -connect [snip]:443 -servername epg.[snip] < /dev/null 2>/dev/null | openssl x509 -noout -enddate -subject
The (defunct) instance is no longer returning a valid certificate chain (i.e. issuer cert is missing)
Checked with
openssl s_client -connect [snip]:443 -servername epg.[snip] < /dev/null 2>/dev/null
OK instance:
depth=2 C=US, O=Internet Security Research Group, CN=ISRG Root X1
verify return:1
depth=1 C=US, O=Let's Encrypt, CN=R10
verify return:1
depth=0 CN=epg.[snip]
verify return:1
---
Certificate chain
0 s:CN=epg.[snip]
i:C=US, O=Let's Encrypt, CN=R10
a:PKEY: RSA, 2048 (bit); sigalg: sha256WithRSAEncryption
v:NotBefore: May 21 06:40:13 2025 GMT; NotAfter: Aug 19 06:40:12 2025 GMT
1 s:C=US, O=Let's Encrypt, CN=R10
i:C=US, O=Internet Security Research Group, CN=ISRG Root X1
a:PKEY: RSA, 2048 (bit); sigalg: sha256WithRSAEncryption
v:NotBefore: Mar 13 00:00:00 2024 GMT; NotAfter: Mar 12 23:59:59 2027 GMT
Defunct instance:
depth=0 CN=epg.[snip]
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN=epg.[snip]
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 CN=epg.[snip]
verify return:1
---
Certificate chain
0 s:CN=epg.[snip]
i:C=US, O=Let's Encrypt, CN=R10
a:PKEY: RSA, 2048 (bit); sigalg: sha256WithRSAEncryption
v:NotBefore: May 21 06:40:13 2025 GMT; NotAfter: Aug 19 06:40:12 2025 GMT
Clueless as to how to debug this further. Any help appreciated.
Hello @joachimbuechse ,
The certificates are written on disk only if there a haproxy reload. The runtime certificates are updated in any case, and it seems it's ok for you. It's possible that on the instance where the certificate is not updated on disk, there was no reload. You could check if that's the case.
There is also a startup --disable-writing-only-if-reload flag that disable this behaviour and always update the certificates on disk immediatly (even if no reload of haproxy is happening).
I've updated the original ticket description as I missed the most important part in the ticket. When this situation happens, hapi does not return a full certificate chain but only the site cert itself without the issuer certificate.
I've tested with --disable-writing-only-if-reload and without. The option is apparently not required for certificate writing. I just pushed an older tls secret. Two instances of hapi one configured with the option
[[email protected]]:/opt/haproxy-ingress/config/certs/frontend # ps -edf | grep haproxy-ingress-controller
root 2021194 1 0 10:04 ? 00:00:01 /usr/local/bin/haproxy-ingress-controller --program=/usr/sbin/haproxy --config-dir=/opt/haproxy-ingress/config --maps-dir=/opt/haproxy-ingress/config/maps --disable-writing-only-if-reload --runtime-dir=/tmp/haproxy-ingress --external --ingress.class=public-ingress --configmap=haproxy-ingress/public-ingress --default-backend-service=default/deny-all --ipv4-bind-address=[snip]--disable-ipv6 --http-bind-port=80 --https-bind-port=443 --stats-bind-port=8404
the other without
[root@slb-ms08]:/opt/haproxy-ingress/config/certs/frontend # ps -edf | grep ingress
root 1503906 1 0 Apr08 ? 00:21:05 /usr/local/bin/haproxy-ingress-controller --program=/usr/sbin/haproxy --config-dir=/opt/haproxy-ingress/config --maps-dir=/opt/haproxy-ingress/config/maps --runtime-dir=/tmp/haproxy-ingress --external --ingress.class=public-ingress --configmap=haproxy-ingress/public-ingress --default-backend-service=default/deny-all --ipv4-bind-address=[snip] --disable-ipv6 --http-bind-port=80 --https-bind-port=443 --stats-bind-port=8404
Both updated the file in /opt/haproxy-ingress/config/certs/frontend/.
I can't yet say if the option is making a difference in terms of reliability of writing as this issue happens sporadically only for us.
Problem happened again with option --disable-writing-only-if-reload enabled and hap-ingress 3.1.7
This issue may be broader in scope than certificates. My team recently experienced an issue where many services became unreachable across our infrastructure in a way that was consistent with HAProxy reloading deeply stale backend-to-IP mappings from disk. Unfortunately, in our urgency to resolve the production incident, we were unable to collect much supporting data. We'll be following up on that, but hopefully this is useful as a single data point in the interim.
As the incident occurred a few days after an upgrade from 3.0.4 to 3.1.7, we're scrutinizing the parallel filesystem write changes around https://github.com/haproxytech/kubernetes-ingress/commit/1b84fd904b5ff75795f8294552f0a08621f9ef53 and https://github.com/haproxytech/kubernetes-ingress/commit/38f57e8332781ab0e230597655beac4270d88f87 as starting points for our internal investigation. Again, until we have a definitive repro case, there's no data to directly implicate these changes, but they seem like the area where this sort of bug would be most likely to be introduced.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Sad that this ticket gets no attention, we have rolled back to version 3.0.9 to circumvent the issue.
@oktalz
Version used before: 3.1.11
We are encountering the same issue. Downgrading to 3.0.9 helped.
After a certificate change, Controller reloads correctly:
haproxy-kubernetes-ingress-qlrkz kubernetes-ingress-controller 2025/09/09 15:00:53 INFO controller/controller.go:216 [transactionID=ada9744a-3f01-4cd8-b973-c81ac9916221] HAProxy reloaded
haproxy-kubernetes-ingress-9ktn2 kubernetes-ingress-controller 2025/09/09 15:00:53 INFO controller/controller.go:216 [transactionID=68bc9d5e-03c5-4e69-83c2-cccd4feced03] HAProxy reloaded
haproxy-kubernetes-ingress-hxkhg kubernetes-ingress-controller 2025/09/09 15:00:50 INFO controller/controller.go:216 [transactionID=dfb15282-ee77-44b8-8186-bf7e09347d8f] HAProxy reloaded
haproxy-kubernetes-ingress-slbzv kubernetes-ingress-controller 2025/09/09 15:00:52 INFO controller/controller.go:216 [transactionID=a6518da7-861b-4d8b-8a1a-d516786170d0] HAProxy reloaded
haproxy-kubernetes-ingress-xt2gb kubernetes-ingress-controller 2025/09/09 15:00:54 INFO controller/controller.go:216 [transactionID=e0e5b2d0-43fb-468b-abe7-a17622fdf997] HAProxy reloaded
Files on disk:
$ for pod in $(kubectl get pods -n haproxy -o name); do kubectl exec $pod -n haproxy -- ls -lt /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem ; done
-rw-rw-rw- 1 haproxy haproxy 2696 Sep 9 14:51 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
-rw-rw-rw- 1 haproxy haproxy 2696 Sep 9 14:51 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
-rw-rw-rw- 1 haproxy haproxy 2696 Sep 9 14:51 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
-rw-rw-rw- 1 haproxy haproxy 2700 Sep 9 14:40 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
-rw-rw-rw- 1 haproxy haproxy 2700 Sep 9 14:40 /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
But served certificate is still old on some haproxy pods:
❯ for pod in $(kubectl get pods -n haproxy -o name); do
node=$(kubectl get "$pod" -n haproxy -o jsonpath='{.spec.nodeName}')
echo "=== $pod on node: $node ==="
kubectl exec -n haproxy "$pod" -- \
openssl s_client -connect localhost:8443 -servername test.local < /dev/null \
| openssl x509 -noout -dates
echo "" # Blank line for readability
done
=== pod/haproxy-kubernetes-ingress-9ktn2 on node: default-145b0-ywdhn ===
notBefore=Sep 9 12:51:10 2025 GMT
notAfter=Dec 8 12:51:10 2025 GMT
=== pod/haproxy-kubernetes-ingress-hxkhg on node: default-145b0-dhixv ===
notBefore=Sep 9 12:51:10 2025 GMT
notAfter=Dec 8 12:51:10 2025 GMT
=== pod/haproxy-kubernetes-ingress-qlrkz on node: default-145b0-lvqlo ===
notBefore=Sep 9 12:51:10 2025 GMT
notAfter=Dec 8 12:51:10 2025 GMT
=== pod/haproxy-kubernetes-ingress-slbzv on node: default-145b0-anmca ===
notBefore=Sep 9 12:40:41 2025 GMT
notAfter=Dec 8 12:40:41 2025 GMT
=== pod/haproxy-kubernetes-ingress-xt2gb on node: default-145b0-pvoli ===
notBefore=Sep 9 12:40:41 2025 GMT
notAfter=Dec 8 12:40:41 2025 GMT
@zadjadr we will try to figure out what is happening, just because we weren't able to reproduce it does not mean its working as it should.
I'll keep this open and try to see if we can do something
@oktalz thanks a lot. If there is anything I can try out, Test or provide more info on, please tell me.
This issue may be broader in scope than certificates. My team recently experienced an issue where many services became unreachable across our infrastructure in a way that was consistent with HAProxy reloading deeply stale backend-to-IP mappings from disk. Unfortunately, in our urgency to resolve the production incident, we were unable to collect much supporting data. We'll be following up on that, but hopefully this is useful as a single data point in the interim.
As the incident occurred a few days after an upgrade from 3.0.4 to 3.1.7, we're scrutinizing the parallel filesystem write changes around https://github.com/haproxytech/kubernetes-ingress/commit/1b84fd904b5ff75795f8294552f0a08621f9ef53 and https://github.com/haproxytech/kubernetes-ingress/commit/38f57e8332781ab0e230597655beac4270d88f87 as starting points for our internal investigation. Again, until we have a definitive repro case, there's no data to directly implicate these changes, but they seem like the area where this sort of bug would be most likely to be introduced.
Already mentioned here https://github.com/haproxytech/kubernetes-ingress/issues/702#issuecomment-2891997594
Quite the same issue with backend reload. If too many pod restart happened in short amount of time the backend will have 0 server at the end.
Looking at the haproxy.cfg in the everything is correct with target and pod_ip, but the process itself didn't reload the configuration and he redirect to nothing with an empty backend.
A rollout restart is an off to fix the problem, but we got a lot of production incident because of that.
And it's happen since version 2.X
Here is how I reproduced it:
- Have cert-manager or similar installed in the cluster
- Have a
selfsignedissuer in the cluster (to not annoy letsencrypt) - Deploy Haproxy Ingress Controller
3.1.11(we useDaemonset) - Deploy a simple test deployment with service, ingress
- Create a Certificate object & then delete and let it recreate/update the secret
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello
namespace: test-haproxy
spec:
replicas: 1
selector:
matchLabels:
app: "hello"
template:
metadata:
labels:
app: "hello"
spec:
containers:
- name: hello
image: hashicorp/http-echo
args:
- "-text=Hello from HAProxy!"
ports:
- containerPort: 5678
---
apiVersion: v1
kind: Service
metadata:
name: hello
namespace: test-haproxy
spec:
ports:
- port: 80
targetPort: 5678
selector:
app: "hello"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hello-ingress
namespace: test-haproxy
annotations:
cert-manager.io/cluster-issuer: selfsigned
spec:
ingressClassName: haproxy
tls:
- hosts:
- test.local
secretName: test-cert-secret
rules:
- host: test.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: hello
port:
number: 80
# cert-for-test.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: test-cert
namespace: test-haproxy
spec:
secretName: test-cert-secret
issuerRef:
name: selfsigned
kind: Issuer
dnsNames:
- test.local
duration: 1h
renewBefore: 5m
Check the certificate
Add your LOADBALANCER_IP
watch -n 1 'openssl s_client -connect ${LOADBALANCER_IP}:443 -servername test.local < /dev/null \
| openssl x509 -noout -dates'
Force certificate rotation
kubectl delete secret test-cert-secret -n test-haproxy
kubectl delete certificate test-cert -n test-haproxy
kubectl apply -f cert-for-test.yaml
Check which pods return which certificate
for pod in $(kubectl get pods -n haproxy -o name); do
node=$(kubectl get "$pod" -n haproxy -o jsonpath='{.spec.nodeName}')
echo "=== $pod on node: $node ==="
kubectl exec -n haproxy $pod -- ls -lt /etc/haproxy/certs/frontend/test-haproxy_test-cert-secret.pem
kubectl exec -n haproxy "$pod" -- \
openssl s_client -connect localhost:8443 -servername test.local < /dev/null \
| openssl x509 -noout -dates
echo "" # Blank line for readability
done
Get secrets certificate timestamp
kubectl get secret test-cert-secret -n test-haproxy -o yaml| grep Time
creationTimestamp: "2025-09-09T12:51:10Z"
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We have also experienced this problem since an upgrade during spring time in one cluster. The problem has happended three times, the last one today. The certificate and the secret created by cert-manager is all fine, but haproxy is using an older one. The solution is to force a rolling update of the pods. When they are restarted, everything is OK. To only reload configuration is not enough.
We have also experienced this problem since an upgrade during spring time in one cluster. The problem has happended three times, the last one today. The certificate and the secret created by cert-manager is all fine, but haproxy is using an older one. The solution is to force a rolling update of the pods. When they are restarted, everything is OK. To only reload configuration is not enough.
Notice: We are not running in external mode. We installed using the helm chart with some custom configuration.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.