enhancements KEP-4631: LoadBalancer Service Status Improvements, initial proposal

One-line PR description: Initial proposal for KEP-4631

Issue link: https://github.com/kubernetes/enhancements/issues/4631

Other comments:

While updating the e2e load balancer tests after the final removal of
in-tree cloud providers, we have run into three problems:

  1. The tests have hard-coded timeouts (that sometimes differ per
     cloud provider) for deciding how long to wait for the cloud
     provider to update the service. It would make much more sense for
     the cloud provider to just provide information about its status
     on the Service object, so the tests could just monitor that.

  2. The tests recognize that not all cloud providers can implement
     all load balancer features, but in the past this was handled by
     hard-coding the information into the individual tests. (e.g.,
     `e2eskipper.SkipUnlessProviderIs("gce", "gke", "aws")`) These
     skip rules no longer work in the providerless tree, and this
     approach doesn't scale anyway. OTOH, we don't want to have to
     provide a separate `Feature:` tag for each load balancer
     subfeature, or have each cloud provider have to maintain their
     own set of `-ginkgo.skip` rules. It would be better if the e2e
     tests themselves could just figure out, somehow, whether they
     were running under a cloud provider that intends to implement the
     feature they are testing, or a cloud provider that doesn't.

  3. In some cases, because the existing tests were only run on
     certain clouds, it is not clear what the expected semantics are
     on other clouds. For example, since `IPMode: Proxy` load
     balancers can't preserve the client source IP in the way that
     `ExternalTrafficPolicy: Local` expects, should they refuse to
     provision a load balancer at all, or should they provision a load
     balancer that fails to preserve the source IP?

This KEP proposes new additions to `service.Status.LoadBalancer` and
`service.Status.Conditions` to allow cloud providers to better
communicate the status of load balancer support and provisioning, and
new guidelines on how cloud providers should handle load balancers for
services that they cannot fully support.

/assign @aojea @thockin

May 13 '24 12:05 danwinship

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

May 13 '24 12:05 k8s-ci-robot

/sig cloud-provider /sig testing

May 13 '24 12:05 danwinship

Updates:

Updated for review comments, started filling in the rest of the template ("Test Plan", "Graduation", "Version Skew Strategy", PRR, etc)
Removed UNRESOLVED auto-skipping: Antonio agrees it's reasonable to have "tri-state" tests (pass/fail/skip) as long as the semantics are explicit. In some cases this will require rewriting the tests to put the "objectionable" bits at the start, so we can always skip immediately after the initial LB provisioning if the cloud doesn't support the feature.
The version skew section made me think about the behavior when a cloud provider is updated and finds itself in a cluster with pre-existing "bad" load balancers, and whether we should maybe add an explicit "unsupported" or "broken" condition.
Added a summary section to the top of the "Expected Behavior When the Cloud Provider Doesn't Know That It Can't Implement a Load Balancer" section, which I think is the big open question before this becomes implementable.
Added detailed notes to the e2e Test Plan section clarifying what skips will be needed for which tests. Also, while writing out the Test Plan section, I realized that to avoid regressing coverage, we're basically going to have to fork the LB tests, so we can have one hacky GCE-and-kind-only set, and one KEP-4631 set, which will initially be Alpha, but which will eventually replace the hacky ones.

May 24 '24 14:05 danwinship

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 04 '24 19:09 k8s-triage-robot

Based on this comment from Tim, I'm assuming the work will happen for 1.32. Please don't forget to fill in the PRR, and its file (see https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml) for this KEP. Feel free to assign me for that approval.

Oct 01 '24 11:10 soltysh

OK, big update, mostly resolving the "when the cloud provider can't implement a feature" question. I went with service.spec.requiredLoadBalancerFeatures, which I think works best overall.

Oct 02 '24 15:10 danwinship

resolving the "when the cloud provider can't implement a feature" question

(I considered dropping that and trying to push through just the LoadBalancerProvisioning and LoadBalancerServing conditions, and saving the "feature discovery" part for later, but it's convenient for clients to be able to assume that if the cloud provider sets the provisioning/serving conditions, then that means it also supports RequiredLoadBalancerFeatures. If we did just the conditions now, then we'd have to add another condition (or something) later to allow the cloud provider to signal its support for feature discovery. Though we could still do that...)

Oct 02 '24 22:10 danwinship

Ad reminder that the PRR file is required for this PR /hold to make sure we don't merge that w/o it

Oct 08 '24 12:10 soltysh

Yeah, I haven't done the PRR part of the KEP yet because we hadn't actually decided on the scope of the work yet. (The enhancements freeze snuck up on my and I'm assuming I've missed it for 1.32.)

Oct 08 '24 13:10 danwinship

Yeah, I haven't done the PRR part of the KEP yet because we hadn't actually decided on the scope of the work yet. (The enhancements freeze snuck up on my and I'm assuming I've missed it for 1.32.)

There's still time, today and tomorrow. The freeze is 19:00 PDT Thursday 10th October 2024 :wink:

Oct 09 '24 10:10 soltysh

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-network/OWNERS~~ [danwinship,thockin]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 10 '24 23:10 k8s-ci-robot

I let this slip, and now it seems too late, but I am OK with the vast majority of this proposal.

/approve

/hold

The higher risk here is adding fields that nobody implements, @danwinship and I talked offline, and Dan has a good idea to simplify this proposal that can achieve the goals and solve the issues detailed here

Oct 11 '24 12:10 aojea

enhancements enhancements copied to clipboard

KEP-4631: LoadBalancer Service Status Improvements, initial proposal

enhancements
enhancements copied to clipboard