kubernetes-ingress
kubernetes-ingress copied to clipboard
Controller not detecting a. service
We are having problem with HAProxy not detecting services within the cluster, when routing requests to cert-manager pods. For example, HAProxy controller returns:
022/08/16 13:46:12 ERROR ingress/ingress.go:245 Ingress 'development/my-service': service 'development/cm-acme-http-solver-498h4' does not exist
2022/08/16 13:46:12 INFO handler/https.go:123 removing client TLS authentication
And the rule from Ingress is like so:
ingressClassName: external-haproxy
rules:
- host: example.com
http:
paths:
- backend:
service:
name: cm-acme-http-solver-498h4
port:
number: 8089
path: /.well-known/path/to/acme-challenge
pathType: ImplementationSpecific
I've checked haproxy.cfg and I cannot find the internal IP either of the mentioned service or the pod.
We are also having split-horizon dns with two HAProxies - internal and external. I've also found the offending line: https://github.com/haproxytech/kubernetes-ingress/blob/v1.8.3/pkg/ingress/ingress.go#L245
Do you know why this happens? It seems that haproxy cannot find a service, despite service existing in the cluster.
I too just had this issue happen. I ended up restarting the haproxy controller deployment and the freshly started pods was able to see the service and traffic worked. I should note I'm using the latest ingress version 1.8.3.
Yes, this issue occurred at my environment too after adding new service. Error message was the same. The only help to make new service accessible was to restart haproxy controller.
Hi thanks for reporting, to help in tracking the issue could you add description of the environment ? Which version of controller is used , how many services and ingresses are there, how old was the controller, if any event occured in the cluster (like a restart of pods, etc), has the service been found and then lost, etc. Thanks in advance.
I've got quite a few ingresses (many different hostnames pointing to a similar backend service).
k get ingress | wc -l
298
It would appear I've got 297 ingresses currently. This was for a new backend service, but I also noted that certmanager's ingress was not able to find the service to validate the certificate request.
- Controller installed via Helm
kubernetes-ingress-1.22.4
- Application version is
1.8.3
- 3 pods for external HAProxy, and they are old around 40 days, there are no pod restarts
- I've checked haproxy.cfg service does not appear there
- There are around 55 ingress split between two HAProxy deployments (one controller for internal/vpc, one for external connectivity to cluster)
@ivanmatmati It seems that restarting did the job! @ocdi @nopsenica @ivanmatmati thank you! But why did the controller failed to find the configuration?
We experience the same issue since version 1.8.x. I mentioned it in https://github.com/haproxytech/kubernetes-ingress/issues/460 originally. Since @ivanmatmati asked me to move to this issue, here are my findings summarized:
- Restart of controller helps
- We see the same ERROR log lines like @petar-nikolovski-cif posted
- memory usage increases over time (see attached image)
- maybe the error originates from store.go (lines 186-189)
Hi @dschuldt , thanks for that. Actually the code lines you're pointing to are rather the consequence than the cause. We still need to figure out what is happening.We're planning to make a long term running of controller to be in better position to observe the events/issues of this matter. @petar-nikolovski-cif That's the question still to solve. We'll keep you informed. Thanks.
I'm experiencing the exact same issue with the controller not being able to find certain services. I tried to restart the controller deployment, and also tried to uninstall and install chart version 1.22.4, but it didn't help. I had to downgrade to chart version 1.21.1 for it to start working again. I did not try any of the charts between 1.22.4 and 1.21.1, so I do not know if a newer version than 1.21.1 works. All I know is that 1.22.4 does not work.
On a test cluster I tried re-creating the bug by abusing the cluster - creating 1000 services (admittedly all pointing to the same deployment), and 1000 ingresses. Doing that all together was quite intense and it used lots of CPU, causing it to scale up. I think it struggled with all the updating ingress loadbalancer status, but ultimately it worked.
I deleted all those 1000 services, waited for it to settle then created smaller batches of services+ingresses. Turns out my config had a http check configured, which because it was hitting the same pod, it overloaded that and caused that to crash. Oops. Aside from that, the controller appeared to keep everything in sync. It has only been running for 95 minutes so not a full test.
The only thing I was thinking that might cause an issue is perhaps interruptions to the control plane, like when I upgrade it. So I upgraded the control plane to a newer patch version, without upgrading the nodes, but that too was fine.
I kind of wish this triggered the bug, but I suspect it isn't that easy. If there are any other tests like this, that would be helpful, I'm happy to do that.
Thanks @ocdi for your effort and your report. We still have to run the test we're planning to do.This is indeed not a trivial issue. It can imply multiple actors with cluster, resources and network events, and require a timing condition as well.
In my case, it had nothing to do with resources or load. There was no real load on the cluster, and I could create working services and ingress' in other namespace. It was just services in one particular NS that the Haproxy controller could not find services in. I tried deleting and recreating the NS, and also tried creating several services that worked in other NS, in the affected NS, but Haproxy could still not find the services. I do not know if it could have something to do with the name of the NS for instance. My NS was named "vanilla-deploy". Could it have something to do with characters or length of the name? I haven't had time to investigate any further.
I am struggling with the same issue.
2022/09/06 05:14:58 ERROR ingress/ingress.go:251 Ingress 'landmark/crate-private': service 'landmark/crate' does not exist
It is happening also for us, we are running 1.8.3 docker image version, and even if we upgraded to 1.8.4 docker image version, is the same behavior. As @ocdi said: seems to be a resolution when a new service is deployed: delete the haproxy pods, delete the service pods, and after that is working. We would love to see a hotfix for this asap.
We are facing this issue as well. Is it possible to prioritize this, as it has a high impact (imho)? Cheers
Seeing the same here on 1.8.3. If there's no fix in the works, can we get confirmation for a stable version to revert to in the meantime?
cc @oktalz if you have any thoughts?
The thing that makes this one difficult is the underlying code dealing with syncing has been there for ages. I had a look and while I'm not a go programmer, there is code depending on the k8s informer to get updates from the control plane.
I'm wondering is this bug happening because the informer has an error and reconnects, missing data while there is an error? It is weird that this seems to happen for all haproxy pods, though the same cause could affect if it is a momentary control plane reconnection issue (tbh I didn't check all pods so one may have been fine).
How does the code handle errors with the informer and resync?
Hi @evandam , it's not easy to say because we can't reproduce the issue right now. From reports, it seems that pre version 1.8 could be harmless so I'd go for 1.7.x. @ocdi, the controller has a resync cache set by default at 10mins. So if any connection error happened with informers it shouldn't last more thant 10mins unless reconnection is impossible. Anyway even if a problem occurs with informer connection the local data we've got in a separate store shouldn't be deleted. It can only happen if informer gets a deleted event ...
@mblixter, can you check you don't have a --namespace-whitelist or --namespace-blacklist command line parameter for your controller ?
So I was doing a bunch of work yesterday involving setting up new sites, which needed ssl certs. The controller pods had been running for about 4 days and was working fine. For no obvious reason, the service not found occurred when adding another certmanager solver ingress.
I saved the logs from both the ingress pods and restarted the service, immediately solving the certmanager challenges.
One thing that strikes me is the fact that both the pods were missing the services. The thing that weirds me out is the fact that it is seeing the new ingresses (presumably without that it wouldn't report not finding a service). What could be different with the ingress vs service handling?
In case it helps, these are the args for my service
--default-ssl-certificate=default/haproxy-kubernetes-ingress-default-cert
--configmap=default/haproxy-kubernetes-ingress
--http-bind-port=8080
--https-bind-port=8443
--default-backend-service=default/haproxy-kubernetes-ingress-default-backend
--ingress.class=haproxy
--publish-service=default/haproxy-kubernetes-ingress
--log=info
--configmap-errorfiles=default/haproxy-errorfiles-configmap
Same problem here. I'm running 3 different clusters with number of services ranging from dozen up to three hundred. Issue appear on all 3 clusters. Usually it happens in a week after restart. Services affected seems to be completely random and only some of them stop working.
I think we can perfectly reproduce this issue again and again. If by any chance it can help the project by us demoing that we can reproduce and you collecting metrics, dumps, logs and whatnot we'll happily set aside some time for this.
Ready for such a session @ivanmatmati
Thanks @LarsBingBong , indeed It could greatly help. I'll talk about that idea to the team and get back to you if ok.
@ivanmatmati please do - we're ready by the phone/zoom/meet/teams/Slack whatever medium will have this call over.
ivanmatmati
I'm sorry I hadn't seen your comment. I can confirm that there's no black or white listed namespaces in my cluster, and no cli parameters for the controller.
@LarsBingBong , what's your time zone?
CET
Thanks, cool. Can you contact me on Slack to schedule the appointment ?
I've now done so @ivanmatmati
Any updates? This happens really often :(