kubernetes-ingress Controller not detecting a. service

We are having problem with HAProxy not detecting services within the cluster, when routing requests to cert-manager pods. For example, HAProxy controller returns:

022/08/16 13:46:12 ERROR   ingress/ingress.go:245 Ingress 'development/my-service': service 'development/cm-acme-http-solver-498h4' does not exist
2022/08/16 13:46:12 INFO    handler/https.go:123 removing client TLS authentication

And the rule from Ingress is like so:

  ingressClassName: external-haproxy
  rules:
  - host: example.com
    http:
      paths:
      - backend:
          service:
            name: cm-acme-http-solver-498h4
            port:
              number: 8089
        path: /.well-known/path/to/acme-challenge
        pathType: ImplementationSpecific

I've checked haproxy.cfg and I cannot find the internal IP either of the mentioned service or the pod.

We are also having split-horizon dns with two HAProxies - internal and external. I've also found the offending line: https://github.com/haproxytech/kubernetes-ingress/blob/v1.8.3/pkg/ingress/ingress.go#L245

Do you know why this happens? It seems that haproxy cannot find a service, despite service existing in the cluster.

Aug 17 '22 09:08 petar-nikolovski-cif

I too just had this issue happen. I ended up restarting the haproxy controller deployment and the freshly started pods was able to see the service and traffic worked. I should note I'm using the latest ingress version 1.8.3.

Aug 19 '22 07:08 ocdi

Yes, this issue occurred at my environment too after adding new service. Error message was the same. The only help to make new service accessible was to restart haproxy controller.

Aug 19 '22 07:08 nopsenica

Hi thanks for reporting, to help in tracking the issue could you add description of the environment ? Which version of controller is used , how many services and ingresses are there, how old was the controller, if any event occured in the cluster (like a restart of pods, etc), has the service been found and then lost, etc. Thanks in advance.

Aug 19 '22 07:08 ivanmatmati

I've got quite a few ingresses (many different hostnames pointing to a similar backend service).

k get ingress | wc -l
298

It would appear I've got 297 ingresses currently. This was for a new backend service, but I also noted that certmanager's ingress was not able to find the service to validate the certificate request.

Aug 19 '22 07:08 ocdi

Controller installed via Helm kubernetes-ingress-1.22.4
Application version is 1.8.3
3 pods for external HAProxy, and they are old around 40 days, there are no pod restarts
I've checked haproxy.cfg service does not appear there
There are around 55 ingress split between two HAProxy deployments (one controller for internal/vpc, one for external connectivity to cluster)

Aug 19 '22 07:08 petar-nikolovski-cif

@ivanmatmati It seems that restarting did the job! @ocdi @nopsenica @ivanmatmati thank you! But why did the controller failed to find the configuration?

Aug 19 '22 08:08 petar-nikolovski-cif

We experience the same issue since version 1.8.x. I mentioned it in https://github.com/haproxytech/kubernetes-ingress/issues/460 originally. Since @ivanmatmati asked me to move to this issue, here are my findings summarized:

Restart of controller helps
We see the same ERROR log lines like @petar-nikolovski-cif posted
memory usage increases over time (see attached image)
maybe the error originates from store.go (lines 186-189)

183623719-7e11299e-6699-42ec-9058-09f32f6aa0a5

Aug 21 '22 08:08 dschuldt

Hi @dschuldt , thanks for that. Actually the code lines you're pointing to are rather the consequence than the cause. We still need to figure out what is happening.We're planning to make a long term running of controller to be in better position to observe the events/issues of this matter. @petar-nikolovski-cif That's the question still to solve. We'll keep you informed. Thanks.

Aug 22 '22 08:08 ivanmatmati

I'm experiencing the exact same issue with the controller not being able to find certain services. I tried to restart the controller deployment, and also tried to uninstall and install chart version 1.22.4, but it didn't help. I had to downgrade to chart version 1.21.1 for it to start working again. I did not try any of the charts between 1.22.4 and 1.21.1, so I do not know if a newer version than 1.21.1 works. All I know is that 1.22.4 does not work.

Sep 01 '22 12:09 mblixter

On a test cluster I tried re-creating the bug by abusing the cluster - creating 1000 services (admittedly all pointing to the same deployment), and 1000 ingresses. Doing that all together was quite intense and it used lots of CPU, causing it to scale up. I think it struggled with all the updating ingress loadbalancer status, but ultimately it worked.

I deleted all those 1000 services, waited for it to settle then created smaller batches of services+ingresses. Turns out my config had a http check configured, which because it was hitting the same pod, it overloaded that and caused that to crash. Oops. Aside from that, the controller appeared to keep everything in sync. It has only been running for 95 minutes so not a full test.

The only thing I was thinking that might cause an issue is perhaps interruptions to the control plane, like when I upgrade it. So I upgraded the control plane to a newer patch version, without upgrading the nodes, but that too was fine.

I kind of wish this triggered the bug, but I suspect it isn't that easy. If there are any other tests like this, that would be helpful, I'm happy to do that.

Sep 03 '22 09:09 ocdi

Thanks @ocdi for your effort and your report. We still have to run the test we're planning to do.This is indeed not a trivial issue. It can imply multiple actors with cluster, resources and network events, and require a timing condition as well.

Sep 05 '22 07:09 ivanmatmati

In my case, it had nothing to do with resources or load. There was no real load on the cluster, and I could create working services and ingress' in other namespace. It was just services in one particular NS that the Haproxy controller could not find services in. I tried deleting and recreating the NS, and also tried creating several services that worked in other NS, in the affected NS, but Haproxy could still not find the services. I do not know if it could have something to do with the name of the NS for instance. My NS was named "vanilla-deploy". Could it have something to do with characters or length of the name? I haven't had time to investigate any further.

Sep 05 '22 11:09 mblixter

I am struggling with the same issue.

2022/09/06 05:14:58 ERROR   ingress/ingress.go:251 Ingress 'landmark/crate-private': service 'landmark/crate' does not exist

Sep 06 '22 05:09 junyeong-huray

It is happening also for us, we are running 1.8.3 docker image version, and even if we upgraded to 1.8.4 docker image version, is the same behavior. As @ocdi said: seems to be a resolution when a new service is deployed: delete the haproxy pods, delete the service pods, and after that is working. We would love to see a hotfix for this asap.

Sep 06 '22 12:09 idonca

We are facing this issue as well. Is it possible to prioritize this, as it has a high impact (imho)? Cheers

Sep 06 '22 15:09 elifsamedin

Seeing the same here on 1.8.3. If there's no fix in the works, can we get confirmation for a stable version to revert to in the meantime?

cc @oktalz if you have any thoughts?

Sep 07 '22 20:09 evandam

The thing that makes this one difficult is the underlying code dealing with syncing has been there for ages. I had a look and while I'm not a go programmer, there is code depending on the k8s informer to get updates from the control plane.

I'm wondering is this bug happening because the informer has an error and reconnects, missing data while there is an error? It is weird that this seems to happen for all haproxy pods, though the same cause could affect if it is a momentary control plane reconnection issue (tbh I didn't check all pods so one may have been fine).

How does the code handle errors with the informer and resync?

Sep 07 '22 23:09 ocdi

Hi @evandam , it's not easy to say because we can't reproduce the issue right now. From reports, it seems that pre version 1.8 could be harmless so I'd go for 1.7.x. @ocdi, the controller has a resync cache set by default at 10mins. So if any connection error happened with informers it shouldn't last more thant 10mins unless reconnection is impossible. Anyway even if a problem occurs with informer connection the local data we've got in a separate store shouldn't be deleted. It can only happen if informer gets a deleted event ...

Sep 08 '22 08:09 ivanmatmati

@mblixter, can you check you don't have a --namespace-whitelist or --namespace-blacklist command line parameter for your controller ?

Sep 08 '22 09:09 ivanmatmati

So I was doing a bunch of work yesterday involving setting up new sites, which needed ssl certs. The controller pods had been running for about 4 days and was working fine. For no obvious reason, the service not found occurred when adding another certmanager solver ingress.

I saved the logs from both the ingress pods and restarted the service, immediately solving the certmanager challenges.

One thing that strikes me is the fact that both the pods were missing the services. The thing that weirds me out is the fact that it is seeing the new ingresses (presumably without that it wouldn't report not finding a service). What could be different with the ingress vs service handling?

In case it helps, these are the args for my service

      --default-ssl-certificate=default/haproxy-kubernetes-ingress-default-cert
      --configmap=default/haproxy-kubernetes-ingress
      --http-bind-port=8080
      --https-bind-port=8443
      --default-backend-service=default/haproxy-kubernetes-ingress-default-backend
      --ingress.class=haproxy
      --publish-service=default/haproxy-kubernetes-ingress
      --log=info
      --configmap-errorfiles=default/haproxy-errorfiles-configmap

Sep 10 '22 04:09 ocdi

Same problem here. I'm running 3 different clusters with number of services ranging from dozen up to three hundred. Issue appear on all 3 clusters. Usually it happens in a week after restart. Services affected seems to be completely random and only some of them stop working.

Sep 12 '22 06:09 mrglos

I think we can perfectly reproduce this issue again and again. If by any chance it can help the project by us demoing that we can reproduce and you collecting metrics, dumps, logs and whatnot we'll happily set aside some time for this.

Ready for such a session @ivanmatmati

Sep 20 '22 07:09 LarsBingBong

Thanks @LarsBingBong , indeed It could greatly help. I'll talk about that idea to the team and get back to you if ok.

Sep 27 '22 13:09 ivanmatmati

@ivanmatmati please do - we're ready by the phone/zoom/meet/teams/Slack whatever medium will have this call over.

Sep 28 '22 06:09 LarsBingBong

ivanmatmati

I'm sorry I hadn't seen your comment. I can confirm that there's no black or white listed namespaces in my cluster, and no cli parameters for the controller.

Sep 28 '22 07:09 mblixter

@LarsBingBong , what's your time zone?

Sep 30 '22 07:09 ivanmatmati

CET

Sep 30 '22 08:09 LarsBingBong

Thanks, cool. Can you contact me on Slack to schedule the appointment ?

Oct 03 '22 12:10 ivanmatmati

I've now done so @ivanmatmati

Oct 03 '22 19:10 LarsBingBong

Any updates? This happens really often :(

Oct 13 '22 12:10 ognjenVlad

kubernetes-ingress kubernetes-ingress copied to clipboard

Controller not detecting a. service

kubernetes-ingress
kubernetes-ingress copied to clipboard