serving
serving copied to clipboard
Fix marking activator endpoints populated into sks condition accurately
Fixes #15278
- marking activator endpoints populated into sks condition accurately While reconciling SKS, the reconciler will reconcile Public Endpoints and mark Activator Endpoints Populated into Condition at last. If no backends or if we're in the proxy mode, the activator will be considered to back this revision. But in fact, if no activator found, the mode will be forced into "Serve", and not be put in path. Would it be better to exclude this scenario when marking “ActivatorEndpointsPopulated” Condition?
// Otherwise check how long SKS was in proxy mode. // Compute the difference between time we've been proxying with the timeout. // If it's positive, that's the time we need to sleep, if negative -- we // can scale to zero. pf := sks.Status.ProxyFor() to := cfgAS.ScaleToZeroGracePeriod - pf if to <= 0 { logger.Info("Fast path scaling to 0, in proxy mode for: ", pf) return desiredScale, true }
The Condition with type name "ActivatorEndpointsPopulated" in SKS status is used to determine whether the interval for scaling to 0 is reached. If there are no endpoints, is it best not to scale down to 0?
Release Note
NONE
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: yenniechen / name: yennie (37fe895664cfedd57ee8bb75e852bd199c399d69, ba729e4dd14bf538fe67b9cdc75e0357ae7ba95a, 075925ff67aaa32de3fe0efd3e00ab52988818ac, 8c77066d4c0ddd58fd8d707303a06a53ca78be70)
Welcome @yingyueshi! It looks like this is your first PR to knative/serving 🎉
Hi @yingyueshi. Thanks for your PR.
I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/ok-to-test
So that means basically we didn't update the mode with serve mode and used only the initial sks one (proxy) thus the problem. Unit test fails because we never checked the sks status field to be false:
Extra status update for networking.internal.knative.dev/v1alpha1, Kind=ServerlessService/force/serve (-extra, +prevState):
...
&v1alpha1.ServerlessService{
TypeMeta: {},
Conditions: v1.Conditions{
{
Type: "ActivatorEndpointsPopulated",
- Status: "False",
+ Status: "True",
Severity: "Info",
... // 1 ignored and 2 identical fields
},
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: yenniechen Once this PR has been reviewed and has the lgtm label, please assign dsimansk for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
I think there is another case not covered:
{
Name: "keep in serve mode, no endpoints",
Key: "force/proxy",
Objects: []runtime.Object{
SKS("force", "proxy", markHappy, WithPubService, WithPrivateService, WithDeployRef("bar")),
deploy("force", "bar"),
svcpub("force", "proxy"),
svcpriv("force", "proxy"),
endpointspub("force", "proxy", withFilteredPorts(networking.BackendHTTPPort)),
endpointspriv("force", "proxy" /* revision has no endpoints, force proxy mode */),
activatorEndpoints(),
},
WantStatusUpdates: []clientgotesting.UpdateActionImpl{{
Object: SKS("force", "proxy", WithPubService, WithPrivateService, WithDeployRef("bar"),
// Changes from above.
markNoEndpoints),
}},
},
This will pass by setting:
logger.go:146: 2024-05-31T16:22:16.718+0300 DEBUG serverlessservice/reconciler.go:333 Updating status with: v1alpha1.ServerlessServiceStatus{
Status: v1.Status{
ObservedGeneration: 0,
Conditions: v1.Conditions{
{
Type: "ActivatorEndpointsPopulated",
- Status: "False",
+ Status: "True",
I see we accept that condition in one similar test: "OnCreate-no-activator-eps-proxy", where activator does not have endpoints. Any reason why is that, what does that mean when we dont have any endpoints?
So that means basically we didn't update the mode with serve mode and used only the initial sks one (proxy) thus the problem. Unit test fails because we never checked the sks status field to be false:
Extra status update for networking.internal.knative.dev/v1alpha1, Kind=ServerlessService/force/serve (-extra, +prevState): ... &v1alpha1.ServerlessService{ TypeMeta: {}, Conditions: v1.Conditions{ { Type: "ActivatorEndpointsPopulated", - Status: "False", + Status: "True", Severity: "Info", ... // 1 ignored and 2 identical fields },
You are right. We can see that above Condition (with type name "ActivatorEndpointsPopulated") is used to determine whether the interval for scaling to 0 is reached. If there are no endpoints, is it best not to scale down to 0?
// Otherwise check how long SKS was in proxy mode.
// Compute the difference between time we've been proxying with the timeout.
// If it's positive, that's the time we need to sleep, if negative -- we
// can scale to zero.
pf := sks.Status.ProxyFor()
to := cfgAS.ScaleToZeroGracePeriod - pf
if to <= 0 {
logger.Info("Fast path scaling to 0, in proxy mode for: ", pf)
return desiredScale, true
}
// ProxyFor returns how long it has been since Activator was moved
// to the request path.
func (sss *ServerlessServiceStatus) ProxyFor() time.Duration {
cond := sss.GetCondition(ActivatorEndpointsPopulated)
if cond == nil || cond.Status != corev1.ConditionTrue {
return 0
}
return time.Since(cond.LastTransitionTime.Inner.Time)
}
If no activator endpoints, but have revisions endpoints, is it better not scale to Zero while no traffic? If we scale to Zero at that time, maybe we will throw some traffic, for neither activator endpoints nor revision endpoints are added to Public Service.
In the case of no activator endpoints and revision endpoints, it is Zero endpoint and no need to scale to Zero any more. Whether marking activator populated or not, it seems no effect. Is it better to mark it or not that just follow the facts that whether activator is really in the path?
In addition, I searched the code globally and found that this SKS condition (with type name "ActivatorEndpointsPopulated") is only used to compute grace period of scaling to Zero.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 84.78%. Comparing base (
c2d0af1) to head (8c77066). Report is 334 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #15279 +/- ##
==========================================
+ Coverage 84.11% 84.78% +0.66%
==========================================
Files 213 218 +5
Lines 16783 13480 -3303
==========================================
- Hits 14117 11429 -2688
+ Misses 2315 1685 -630
- Partials 351 366 +15
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
For the case where TBC != -1 there is logic that will not allow to for the scale down:
if resolveTBC(ctx, pa) != -1 {
// if TBC is -1 activator is guaranteed to already be in the path.
// Otherwise, probe to make sure Activator is in path.
r, err = ks.activatorProbe(pa, ks.transport)
logger.Infof("Probing activator = %v, err = %v", r, err)
}
if r is false then it is not going to call ProxyFor(). For the case where TBC = -1 though I think we will scale down to zero while activator might not be around and endpoints are populated (without this PR). I think we should not scale down if activator is not in the path right (when desiredScale is 0), @dprotaso wdyth?
I also wonder the answer. For the case TBC=-1 and no ready activator endpoints, with this PR, the handle of scale to 0 will be requeue until activator becomes ready. Once the activator has ready endpoints, SKS controller will watch the change and inform the SKS to mark activator populated, then the revision pods will be scaled to 0. How would it be desinged?
@dprotaso wdyth?
@dprotaso gentle ping.
I think we should not scale down if activator is not in the path right (when desiredScale is 0), @dprotaso wdyth?
Yeah I agree
I think we should update the PR so that we are only marking ActivatorEndpointsPopulated when we are actually wiring the activator in the data path with ready endpoints.
Also thanks for the investigation everyone - it helped me be more informed on the path to take here.
hey @yenniechen just following up
hey @yenniechen just following up
So do you have any suggestions? I thought this PR would solve this problem.
@dprotaso
This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.