enhancements
enhancements copied to clipboard
KEP-4631: LoadBalancer Service Status Improvements, initial proposal
- One-line PR description: Initial proposal for KEP-4631
- Issue link: https://github.com/kubernetes/enhancements/issues/4631
- Other comments:
While updating the e2e load balancer tests after the final removal of
in-tree cloud providers, we have run into three problems:
1. The tests have hard-coded timeouts (that sometimes differ per
cloud provider) for deciding how long to wait for the cloud
provider to update the service. It would make much more sense for
the cloud provider to just provide information about its status
on the Service object, so the tests could just monitor that.
2. The tests recognize that not all cloud providers can implement
all load balancer features, but in the past this was handled by
hard-coding the information into the individual tests. (e.g.,
`e2eskipper.SkipUnlessProviderIs("gce", "gke", "aws")`) These
skip rules no longer work in the providerless tree, and this
approach doesn't scale anyway. OTOH, we don't want to have to
provide a separate `Feature:` tag for each load balancer
subfeature, or have each cloud provider have to maintain their
own set of `-ginkgo.skip` rules. It would be better if the e2e
tests themselves could just figure out, somehow, whether they
were running under a cloud provider that intends to implement the
feature they are testing, or a cloud provider that doesn't.
3. In some cases, because the existing tests were only run on
certain clouds, it is not clear what the expected semantics are
on other clouds. For example, since `IPMode: Proxy` load
balancers can't preserve the client source IP in the way that
`ExternalTrafficPolicy: Local` expects, should they refuse to
provision a load balancer at all, or should they provision a load
balancer that fails to preserve the source IP?
This KEP proposes new additions to `service.Status.LoadBalancer` and
`service.Status.Conditions` to allow cloud providers to better
communicate the status of load balancer support and provisioning, and
new guidelines on how cloud providers should handle load balancers for
services that they cannot fully support.
/assign @aojea @thockin
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
/sig cloud-provider /sig testing
Updates:
- Updated for review comments, started filling in the rest of the template ("Test Plan", "Graduation", "Version Skew Strategy", PRR, etc)
- Removed
UNRESOLVED auto-skipping
: Antonio agrees it's reasonable to have "tri-state" tests (pass/fail/skip) as long as the semantics are explicit. In some cases this will require rewriting the tests to put the "objectionable" bits at the start, so we can always skip immediately after the initial LB provisioning if the cloud doesn't support the feature. - The version skew section made me think about the behavior when a cloud provider is updated and finds itself in a cluster with pre-existing "bad" load balancers, and whether we should maybe add an explicit "unsupported" or "broken" condition.
- Added a summary section to the top of the "Expected Behavior When the Cloud Provider Doesn't Know That It Can't Implement a Load Balancer" section, which I think is the big open question before this becomes
implementable
. - Added detailed notes to the e2e Test Plan section clarifying what skips will be needed for which tests. Also, while writing out the Test Plan section, I realized that to avoid regressing coverage, we're basically going to have to fork the LB tests, so we can have one hacky GCE-and-kind-only set, and one KEP-4631 set, which will initially be Alpha, but which will eventually replace the hacky ones.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Based on this comment from Tim, I'm assuming the work will happen for 1.32. Please don't forget to fill in the PRR, and its file (see https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml) for this KEP. Feel free to assign me for that approval.
OK, big update, mostly resolving the "when the cloud provider can't implement a feature" question. I went with service.spec.requiredLoadBalancerFeatures
, which I think works best overall.
resolving the "when the cloud provider can't implement a feature" question
(I considered dropping that and trying to push through just the LoadBalancerProvisioning
and LoadBalancerServing
conditions, and saving the "feature discovery" part for later, but it's convenient for clients to be able to assume that if the cloud provider sets the provisioning/serving conditions, then that means it also supports RequiredLoadBalancerFeatures
. If we did just the conditions now, then we'd have to add another condition (or something) later to allow the cloud provider to signal its support for feature discovery. Though we could still do that...)
Ad reminder that the PRR file is required for this PR /hold to make sure we don't merge that w/o it
Yeah, I haven't done the PRR part of the KEP yet because we hadn't actually decided on the scope of the work yet. (The enhancements freeze snuck up on my and I'm assuming I've missed it for 1.32.)
Yeah, I haven't done the PRR part of the KEP yet because we hadn't actually decided on the scope of the work yet. (The enhancements freeze snuck up on my and I'm assuming I've missed it for 1.32.)
There's still time, today and tomorrow. The freeze is 19:00 PDT Thursday 10th October 2024 :wink:
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: danwinship, thockin
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~keps/sig-network/OWNERS~~ [danwinship,thockin]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
I let this slip, and now it seems too late, but I am OK with the vast majority of this proposal.
/approve
/hold
The higher risk here is adding fields that nobody implements, @danwinship and I talked offline, and Dan has a good idea to simplify this proposal that can achieve the goals and solve the issues detailed here