serving icon indicating copy to clipboard operation
serving copied to clipboard

Fix marking activator endpoints populated into sks condition accurately

Open yenniechen opened this issue 1 year ago • 20 comments

Fixes #15278

  • marking activator endpoints populated into sks condition accurately While reconciling SKS, the reconciler will reconcile Public Endpoints and mark Activator Endpoints Populated into Condition at last. If no backends or if we're in the proxy mode, the activator will be considered to back this revision. But in fact, if no activator found, the mode will be forced into "Serve", and not be put in path. Would it be better to exclude this scenario when marking “ActivatorEndpointsPopulated” Condition?
 // Otherwise check how long SKS was in proxy mode.
 // Compute the difference between time we've been proxying with the timeout.
 // If it's positive, that's the time we need to sleep, if negative -- we
 // can scale to zero.
 pf := sks.Status.ProxyFor()
 to := cfgAS.ScaleToZeroGracePeriod - pf
 if to <= 0 {
 logger.Info("Fast path scaling to 0, in proxy mode for: ", pf)
  return desiredScale, true
 }

The Condition with type name "ActivatorEndpointsPopulated" in SKS status is used to determine whether the interval for scaling to 0 is reached. If there are no endpoints, is it best not to scale down to 0?

Release Note

NONE

yenniechen avatar May 30 '24 02:05 yenniechen

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: yenniechen / name: yennie (37fe895664cfedd57ee8bb75e852bd199c399d69, ba729e4dd14bf538fe67b9cdc75e0357ae7ba95a, 075925ff67aaa32de3fe0efd3e00ab52988818ac, 8c77066d4c0ddd58fd8d707303a06a53ca78be70)

Welcome @yingyueshi! It looks like this is your first PR to knative/serving 🎉

knative-prow[bot] avatar May 30 '24 02:05 knative-prow[bot]

Hi @yingyueshi. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow[bot] avatar May 30 '24 02:05 knative-prow[bot]

/ok-to-test

dprotaso avatar May 30 '24 19:05 dprotaso

So that means basically we didn't update the mode with serve mode and used only the initial sks one (proxy) thus the problem. Unit test fails because we never checked the sks status field to be false:

Extra status update for networking.internal.knative.dev/v1alpha1, Kind=ServerlessService/force/serve (-extra, +prevState):
...
          &v1alpha1.ServerlessService{
          	TypeMeta:   {},
          			Conditions: v1.Conditions{
          				{
          					Type:     "ActivatorEndpointsPopulated",
        - 					Status:   "False",
        + 					Status:   "True",
          					Severity: "Info",
          					... // 1 ignored and 2 identical fields
          				},

skonto avatar May 31 '24 11:05 skonto

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yenniechen Once this PR has been reviewed and has the lgtm label, please assign dsimansk for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar May 31 '24 13:05 knative-prow[bot]

I think there is another case not covered:

 {
		Name: "keep in serve mode, no endpoints",
		Key:  "force/proxy",
		Objects: []runtime.Object{
			SKS("force", "proxy", markHappy, WithPubService, WithPrivateService, WithDeployRef("bar")),
			deploy("force", "bar"),
			svcpub("force", "proxy"),
			svcpriv("force", "proxy"),
			endpointspub("force", "proxy", withFilteredPorts(networking.BackendHTTPPort)),
			endpointspriv("force", "proxy" /* revision has no endpoints, force proxy mode */),
			activatorEndpoints(),
		},

		WantStatusUpdates: []clientgotesting.UpdateActionImpl{{
			Object: SKS("force", "proxy", WithPubService, WithPrivateService, WithDeployRef("bar"),
				// Changes from above.
				markNoEndpoints),
		}},
	},

This will pass by setting:

    logger.go:146: 2024-05-31T16:22:16.718+0300	DEBUG	serverlessservice/reconciler.go:333	Updating status with:   v1alpha1.ServerlessServiceStatus{
          	Status: v1.Status{
          		ObservedGeneration: 0,
          		Conditions: v1.Conditions{
          			{
          				Type:               "ActivatorEndpointsPopulated",
        - 				Status:             "False",
        + 				Status:             "True",

I see we accept that condition in one similar test: "OnCreate-no-activator-eps-proxy", where activator does not have endpoints. Any reason why is that, what does that mean when we dont have any endpoints?

skonto avatar May 31 '24 13:05 skonto

So that means basically we didn't update the mode with serve mode and used only the initial sks one (proxy) thus the problem. Unit test fails because we never checked the sks status field to be false:

Extra status update for networking.internal.knative.dev/v1alpha1, Kind=ServerlessService/force/serve (-extra, +prevState):
...
          &v1alpha1.ServerlessService{
          	TypeMeta:   {},
          			Conditions: v1.Conditions{
          				{
          					Type:     "ActivatorEndpointsPopulated",
        - 					Status:   "False",
        + 					Status:   "True",
          					Severity: "Info",
          					... // 1 ignored and 2 identical fields
          				},

You are right. We can see that above Condition (with type name "ActivatorEndpointsPopulated") is used to determine whether the interval for scaling to 0 is reached. If there are no endpoints, is it best not to scale down to 0?

 // Otherwise check how long SKS was in proxy mode.
 // Compute the difference between time we've been proxying with the timeout.
 // If it's positive, that's the time we need to sleep, if negative -- we
 // can scale to zero.
 pf := sks.Status.ProxyFor()
 to := cfgAS.ScaleToZeroGracePeriod - pf
 if to <= 0 {
 logger.Info("Fast path scaling to 0, in proxy mode for: ", pf)
	return desiredScale, true
 }
 // ProxyFor returns how long it has been since Activator was moved
 // to the request path.
 func (sss *ServerlessServiceStatus) ProxyFor() time.Duration {
 	cond := sss.GetCondition(ActivatorEndpointsPopulated)
 	if cond == nil || cond.Status != corev1.ConditionTrue {
 		return 0
 	}
 	return time.Since(cond.LastTransitionTime.Inner.Time)
  }

If no activator endpoints, but have revisions endpoints, is it better not scale to Zero while no traffic? If we scale to Zero at that time, maybe we will throw some traffic, for neither activator endpoints nor revision endpoints are added to Public Service.

In the case of no activator endpoints and revision endpoints, it is Zero endpoint and no need to scale to Zero any more. Whether marking activator populated or not, it seems no effect. Is it better to mark it or not that just follow the facts that whether activator is really in the path?

In addition, I searched the code globally and found that this SKS condition (with type name "ActivatorEndpointsPopulated") is only used to compute grace period of scaling to Zero.

yenniechen avatar May 31 '24 14:05 yenniechen

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.78%. Comparing base (c2d0af1) to head (8c77066). Report is 334 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15279      +/-   ##
==========================================
+ Coverage   84.11%   84.78%   +0.66%     
==========================================
  Files         213      218       +5     
  Lines       16783    13480    -3303     
==========================================
- Hits        14117    11429    -2688     
+ Misses       2315     1685     -630     
- Partials      351      366      +15     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jun 02 '24 23:06 codecov[bot]

For the case where TBC != -1 there is logic that will not allow to for the scale down:

		if resolveTBC(ctx, pa) != -1 {
			// if TBC is -1 activator is guaranteed to already be in the path.
			// Otherwise, probe to make sure Activator is in path.
			r, err = ks.activatorProbe(pa, ks.transport)
			logger.Infof("Probing activator = %v, err = %v", r, err)
		}

if r is false then it is not going to call ProxyFor(). For the case where TBC = -1 though I think we will scale down to zero while activator might not be around and endpoints are populated (without this PR). I think we should not scale down if activator is not in the path right (when desiredScale is 0), @dprotaso wdyth?

skonto avatar Jun 03 '24 11:06 skonto

I also wonder the answer. For the case TBC=-1 and no ready activator endpoints, with this PR, the handle of scale to 0 will be requeue until activator becomes ready. Once the activator has ready endpoints, SKS controller will watch the change and inform the SKS to mark activator populated, then the revision pods will be scaled to 0. How would it be desinged?

yenniechen avatar Jun 04 '24 08:06 yenniechen

@dprotaso wdyth?

skonto avatar Jun 14 '24 07:06 skonto

@dprotaso gentle ping.

skonto avatar Jun 19 '24 07:06 skonto

I think we should not scale down if activator is not in the path right (when desiredScale is 0), @dprotaso wdyth?

Yeah I agree

dprotaso avatar Jun 25 '24 19:06 dprotaso

I think we should update the PR so that we are only marking ActivatorEndpointsPopulated when we are actually wiring the activator in the data path with ready endpoints.

dprotaso avatar Jun 25 '24 19:06 dprotaso

Also thanks for the investigation everyone - it helped me be more informed on the path to take here.

dprotaso avatar Jun 25 '24 19:06 dprotaso

hey @yenniechen just following up

dprotaso avatar Jul 09 '24 21:07 dprotaso

hey @yenniechen just following up

So do you have any suggestions? I thought this PR would solve this problem.

yenniechen avatar Jul 12 '24 04:07 yenniechen

@dprotaso

yenniechen avatar Jul 15 '24 11:07 yenniechen

This Pull Request is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen with /reopen. Mark as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Oct 14 '24 01:10 github-actions[bot]