cloud-provider-azure Ingress can suddenly break health check of the LoadBalancer

What happened:

I created cluster and ingress as per https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/aks/ingress-internal-ip.md It used to work before 1.22.6.

On 1.22.6 ingress not working because the health check protocol is HTTP and is located on /healthz on 80.

But we have path: /(.*) pathType: Prefix backend: service: name: aks-helloworld

that breaks /healthz since we have aks-helloworld on /

That basically means that any user with enough permissions to add ingress can break the whole cluster.

Maybe TCP should be used instead of HTTP by default.

What you expected to happen:

The default configuration works reliably.

How to reproduce it (as minimally and precisely as possible):

Follow https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/aks/ingress-internal-ip.md on 1.22.6 version.

May 25 '22 07:05 rumatavz

@MartinForReal to have a look

May 26 '22 06:05 feiskyer

@rumatavz Please change AppProtocol to TCP in svc manifest. As documented here https://kubernetes-sigs.github.io/cloud-provider-azure/topics/loadbalancer/#custom-load-balancer-health-probe SLB probe is using http because AppProtocol in svc is http. FYI

May 26 '22 08:05 MartinForReal

I employed another workaround: I've added azure-load-balancer-health-probe-protocol. I don't have problems with this anymore.

But what I'm saying is the default configuration is dangerous and other users can find the whole cluster not working because of this issue.

Jun 01 '22 11:06 rumatavz

Maybe I'm wrong, but the docs seem to suggest that there is a huge change in behaviour from 1.23 to 1.24, potentially amplifying this issue:

For clusters >1.24, spec.ports.appProtocol would be used as probe protocol and / would be used as default probe request path (service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path could be used to change to a different request path).

nginx-ingress for example runs into this issue in its default configuration for its plain HTTP port. When admins upgrade from Kubernetes 1.23 to 1.24, their nginx-ingress will suddenly stop working, no?

Aug 08 '22 13:08 embik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 06 '22 13:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Dec 06 '22 14:12 k8s-triage-robot

Maybe I'm wrong, but the docs seem to suggest that there is a huge change in behaviour from 1.23 to 1.24, potentially amplifying this issue:

For clusters >1.24, spec.ports.appProtocol would be used as probe protocol and / would be used as default probe request path (service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path could be used to change to a different request path).

nginx-ingress for example runs into this issue in its default configuration for its plain HTTP port. When admins upgrade from Kubernetes 1.23 to 1.24, their nginx-ingress will suddenly stop working, no?

I confirm our nginx-ingress broke this morning after auto update to 1.24... I had to add an annotations to configure the health path in helm for nginx-ingress

Dec 26 '22 19:12 Socolin

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jan 25 '23 19:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 25 '23 19:01 k8s-ci-robot

cloud-provider-azure cloud-provider-azure copied to clipboard

Ingress can suddenly break health check of the LoadBalancer

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

cloud-provider-azure
cloud-provider-azure copied to clipboard