[occm] LoadBalancer with externalTrafficPolicy=Local does not create health monitors by default
/kind bug
What happened:
The default behavior of OCCM is not to create health monitors for load balancers, unless enabled globally via config file or for individual LoadBalancers using the loadbalancer.openstack.org/enable-health-monitor annotation.
With the default OCCM configuration, if a LoadBalancer service is created with externalTrafficPolicy: Local, the OpenStack LB will be configured incorrectly and with no warning. Importantly, the failure mode here can be extremely confusing and difficult to pin down: not every workload behind a LoadBalancer configured this way would fail necessarily, depending on replica count and distribution of pods, and some workloads might fail in different ways than others or otherwise exhibit unexpected behavior.
What you expected to happen:
OCCM should either:
- Error when attempting to reconcile a LoadBalancer service with
externalTrafficPolicy: Localif health monitors are disabled, or - Always create health monitors when reconciling a LoadBalancer service with
externalTrafficPolicy: Local, regardless of the presence of the annotation or the config option.
How to reproduce it:
- Create a cluster with more than one worker node and OCCM configured with only global settings and everything else default
- Deploy a workload with 1 replica (or fewer replicas than worker nodes) behind a LoadBalancer service with
externalTrafficPolicy: Local - Repeatedly attempt to make a request to the service
- Observe that only one out of every [# worker nodes] requests goes through
Anything else we need to know?:
Would be happy to contribute a fix for this
Environment:
- openstack-cloud-controller-manager(or other related binary) version: 1.31.1
- OpenStack version: yoga
- Others:
OpenStack LB will be configured incorrectly and with no warning
could you please provide more details on what exactly is incorrect?
At first glance it's a duplicate of the #1770
See also #2869
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
could you please provide more details on what exactly is incorrect?
externalTrafficPolicy: Local only makes sense in conjunction with health monitors to determine which nodes are healthy and can receive traffic. If there are no health monitors, traffic is routed to any endpoint, so some requests might fail if they happen to be routed to an unhealthy node.
Routing traffic to unhealthy nodes may be a desired behavior in some scenarios, but it should be done intentionally by the LB based on available health information.
the create-monitor OCCM config option was added by a puprose. Some openstack-based cloud providers don't support them. Therefore if you want to use externalTrafficPolicy: Local, you should enbale this option in OCCM config or in service annotations. This statement is also mentioned in docs:
create-monitorIndicates whether or not to create a health monitor for the service load balancer. A health monitor required for services that declareexternalTrafficPolicy: Local. Default:false
Yes, however having it default to false and requiring it to be set manually when externalTrafficPolicy=Local is a bit confusing. It is easy to miss this in the docs, and if you do, it can be very difficult to narrow down the issue when things don't work. I think it would make sense to have create-monitor be enabled by default for services that have externalTrafficPolicy=Local.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.