cloud-provider-openstack icon indicating copy to clipboard operation
cloud-provider-openstack copied to clipboard

[occm] LoadBalancer with externalTrafficPolicy=Local does not create health monitors by default

Open kralicky opened this issue 9 months ago • 1 comments

/kind bug

What happened:

The default behavior of OCCM is not to create health monitors for load balancers, unless enabled globally via config file or for individual LoadBalancers using the loadbalancer.openstack.org/enable-health-monitor annotation.

With the default OCCM configuration, if a LoadBalancer service is created with externalTrafficPolicy: Local, the OpenStack LB will be configured incorrectly and with no warning. Importantly, the failure mode here can be extremely confusing and difficult to pin down: not every workload behind a LoadBalancer configured this way would fail necessarily, depending on replica count and distribution of pods, and some workloads might fail in different ways than others or otherwise exhibit unexpected behavior.

What you expected to happen:

OCCM should either:

  • Error when attempting to reconcile a LoadBalancer service with externalTrafficPolicy: Local if health monitors are disabled, or
  • Always create health monitors when reconciling a LoadBalancer service with externalTrafficPolicy: Local, regardless of the presence of the annotation or the config option.

How to reproduce it:

  1. Create a cluster with more than one worker node and OCCM configured with only global settings and everything else default
  2. Deploy a workload with 1 replica (or fewer replicas than worker nodes) behind a LoadBalancer service with externalTrafficPolicy: Local
  3. Repeatedly attempt to make a request to the service
  4. Observe that only one out of every [# worker nodes] requests goes through

Anything else we need to know?:

Would be happy to contribute a fix for this

Environment:

  • openstack-cloud-controller-manager(or other related binary) version: 1.31.1
  • OpenStack version: yoga
  • Others:

kralicky avatar Mar 25 '25 21:03 kralicky

OpenStack LB will be configured incorrectly and with no warning

could you please provide more details on what exactly is incorrect?

At first glance it's a duplicate of the #1770

See also #2869

kayrus avatar May 21 '25 07:05 kayrus

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 19 '25 08:08 k8s-triage-robot

could you please provide more details on what exactly is incorrect?

externalTrafficPolicy: Local only makes sense in conjunction with health monitors to determine which nodes are healthy and can receive traffic. If there are no health monitors, traffic is routed to any endpoint, so some requests might fail if they happen to be routed to an unhealthy node. Routing traffic to unhealthy nodes may be a desired behavior in some scenarios, but it should be done intentionally by the LB based on available health information.

kralicky avatar Aug 19 '25 20:08 kralicky

the create-monitor OCCM config option was added by a puprose. Some openstack-based cloud providers don't support them. Therefore if you want to use externalTrafficPolicy: Local, you should enbale this option in OCCM config or in service annotations. This statement is also mentioned in docs:

create-monitor Indicates whether or not to create a health monitor for the service load balancer. A health monitor required for services that declare externalTrafficPolicy: Local. Default: false

kayrus avatar Aug 28 '25 12:08 kayrus

Yes, however having it default to false and requiring it to be set manually when externalTrafficPolicy=Local is a bit confusing. It is easy to miss this in the docs, and if you do, it can be very difficult to narrow down the issue when things don't work. I think it would make sense to have create-monitor be enabled by default for services that have externalTrafficPolicy=Local.

kralicky avatar Aug 28 '25 16:08 kralicky

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 27 '25 16:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 27 '25 16:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Oct 27 '25 16:10 k8s-ci-robot