serving Distinguish zero concurrency from slow/failed scraping when bucketing

Describe the feature

Currently we do not differentiate between a scrape that actually reports zero concurrency from a replica and just not having data for a particular bucket. This is fine if the network is fast and autoscaler is not overloaded because we will have data ~every second, but on a slow or overloaded network (or e.g. with a resource constrained host => slow QP response to scrapes) it could cause issues: when we average over the bucket we could think we have lower load than we do, and scale down (or fail to scale up) replicas incorrectly.

(This is somewhat related to https://github.com/knative/serving/issues/8377 in that if we introduce a work pool there's a greater danger of things backed up in the queue not getting stats every second).

Jul 13 '20 09:07 julz

cc @markusthoemmes @vagababov for thoughts

Aug 17 '20 22:08 mattmoor

Actually it's the same as the @duglin issue we revisited earlier :)

Aug 17 '20 23:08 vagababov

Want to find it and dupe?

Aug 18 '20 01:08 mattmoor

Done.

Aug 18 '20 05:08 vagababov

FWIW I think this isn't totally the same as https://github.com/knative/serving/issues/8390. In https://github.com/knative/serving/issues/8390 @duglin is scraping at the correct rate, but the pockets of zero concurrency lead us to end up with less replicas than we actually need for peak load (because the simulated workload is a GitHub trigger firing multiple parallel events every 10 seconds or so, and we average over the full window). That one can potentially be fixed by the max-vs-average flag we've been informally chatting about: I'll pull out a top level issue for that now.

This one, I think, is slightly different. When the network is slow, or blips, we can get zero scaling data for a few seconds (or longer) and our current behaviour is to treat any gaps in data as if we'd actually seen concurrency zero. This means if the scraper loses connectivity to the pods, or the network is temporarily congested, we can start to rapidly scale down the workload as fast as max-scale-down-rate will let us. The 'max' switch described above that would potentially help bursty loads would cope with this slightly better, but I think it's a cross-cutting problem we should solve in both cases: for example by assuming the rolling average rather than 0 when we miss a scrape.

Edit: Spun out https://github.com/knative/serving/issues/9092.

Aug 18 '20 10:08 julz

Since #8390 was closed, I want to add my testcase to this one because I'm still seeing odd behavior even after #9092 is merged.

Script:

#!/bin/bash

set -e
PAUSE=${1:-10}
echo "Erasing old service"
kn service delete echo > /dev/null 2>&1 && sleep 5
URL=`kn service create echo --image duglin/echo --concurrency-limit=1 | tail -1`
echo "Service: $URL"
echo "Pause: $PAUSE"
for i in `seq 1 20` ; do
  echo -n "$i : Running 50... "
  (time (
    for i in `seq 1 50` ; do
      curl -s ${URL}?sleep=10 > /dev/null &
    done
    wait )
  ) 2>&1 | grep real | tr '\n' '\0'
  echo -n " # pods: "
  kubectl get pods | grep echo.*Running | wc -l
  sleep $PAUSE
done
kn service delete echo

And output I see today:

$ ./bug 
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
Pause: 10
1 : Running 50... real  0m19.686s # pods: 72
2 : Running 50... real  0m10.168s # pods: 72
3 : Running 50... real  0m10.177s # pods: 72
4 : Running 50... real  0m10.181s # pods: 48
5 : Running 50... real  0m10.186s # pods: 72
6 : Running 50... real  0m10.162s # pods: 72
7 : Running 50... real  0m10.183s # pods: 36
8 : Running 50... real  0m10.190s # pods: 72
9 : Running 50... real  0m10.174s # pods: 72
10 : Running 50... real 0m10.192s # pods: 36
11 : Running 50... real 0m10.204s # pods: 72
12 : Running 50... real 0m10.171s # pods: 72
13 : Running 50... real 0m10.191s # pods: 36
14 : Running 50... real 0m20.176s # pods: 72
15 : Running 50... real 0m10.192s # pods: 72
16 : Running 50... real 0m10.189s # pods: 72
17 : Running 50... real 0m20.139s # pods: 72
18 : Running 50... real 0m10.186s # pods: 72
19 : Running 50... real 0m10.213s # pods: 72
20 : Running 50... real 0m10.188s # pods: 39
Service 'echo' successfully deleted in namespace 'default'.

Notice how the # of pods isn't consistent and it going below 50 doesn't seem right. But the 2x latency at times is obviously the biggest concern.

Oct 03 '20 20:10 duglin

Using:

URL=`kn service create echo --image duglin/echo --concurrency-limit=1 \
  --annotation-revision autoscaling.knative.dev/scaleDownDelay=120s \
  --annotation-revision autoscaling.knative.dev/window=6s \

helped w.r.t. latency - it was around 10 seconds consistently. However, I had 72 pods the entire time, which just doesn't seem right when I only have 50 requests. Yes I know that TU (70%) is probably why I get an extra 12 pods, but from a user's POV it's hard to explain. I wonder if we need to make it more clear that this "utilization" isn't just per pod, but across all pods and really should be look at like some kind of "over provisioning" flag. Then it's clear that anything other than 100% means they're asking for "extra" unused space. And this space is calculated across all pods, not just within one.

Oct 17 '20 16:10 duglin

Meaning, (# of requests) * (CC/TU%) == # of pods they should see

Oct 17 '20 16:10 duglin

Run with TU=95%? :)

On Sat, Oct 17, 2020 at 9:39 AM Doug Davis [email protected] wrote:

Meaning, (# of requests) * (CC/TU%) == # of pods they should see

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/knative/serving/issues/8610#issuecomment-711042284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF2WXZ7YOGNK6CFTB73KLLSLHCCZANCNFSM4OYJZFJQ .

Oct 17 '20 18:10 vagababov

Just FYI:

$ cat bug
#!/bin/bash

set -e
PAUSE=${PAUSE:-20}
COUNT=${COUNT:-20}
SDD=${SDD:-6s}
TU=${TU:-100}
WIN=${WIN:-6s}
CC=${CC:-1}
echo "Erasing old service"
kn service delete echo > /dev/null 2>&1 && sleep 5
URL=`kn service create echo --image duglin/echo --concurrency-limit=$CC \
  --annotation-revision autoscaling.knative.dev/scaleDownDelay=$SDD \
  --annotation-revision autoscaling.knative.dev/targetUtilizationPercentage=$TU \
  --annotation-revision autoscaling.knative.dev/WIN=$WIN \
  | tail -1`
echo "Service: $URL"
echo "PAUSE: $PAUSE"
echo "CC: $CC"
echo "SDD: $SDD"
echo "TU: $TU"
echo "WIN: $WIN"

for i in `seq 1 $COUNT` ; do
  echo -n "$i : Running 50... "
  (time (
    for i in `seq 1 50` ; do
      curl -s ${URL}?sleep=10 > /dev/null &
    done
    wait )
  ) 2>&1 | grep real | tr '\n' '\0'
  echo -n " # pods: "
  kubectl get pods | grep echo.*Running | wc -l
  sleep $PAUSE
done
kn service delete echo

$ PAUSE=20 ./bug
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
PAUSE: 20
CC: 1
SDD: 6s
TU: 100
WIN: 6s
1 : Running 50... real  0m18.245s # pods: 50
2 : Running 50... real  0m20.091s # pods: 49
3 : Running 50... real  0m20.108s # pods: 50
4 : Running 50... real  0m20.143s # pods: 50
5 : Running 50... real  0m19.347s # pods: 50
6 : Running 50... real  0m20.105s # pods: 49
7 : Running 50... real  0m20.155s # pods: 50
8 : Running 50... real  0m20.155s # pods: 50
9 : Running 50... real  0m20.163s # pods: 50
10 : Running 50... real 0m19.596s # pods: 50
11 : Running 50... real 0m20.126s # pods: 50
12 : Running 50... real 0m19.307s # pods: 50
13 : Running 50... real 0m20.131s # pods: 50
14 : Running 50... real 0m20.097s # pods: 50
15 : Running 50... real 0m19.604s # pods: 50
16 : Running 50... real 0m20.141s # pods: 50
17 : Running 50... real 0m20.132s # pods: 50
18 : Running 50... real 0m20.116s # pods: 49
19 : Running 50... real 0m20.181s # pods: 50
20 : Running 50... real 0m20.179s # pods: 49

$ PAUSE=10 ./bug
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
PAUSE: 10
CC: 1
SDD: 6s
TU: 100
WIN: 6s
1 : Running 50... real  0m19.904s # pods: 50
2 : Running 50... real  0m10.185s # pods: 29
3 : Running 50... real  0m10.182s # pods: 50
4 : Running 50... real  0m20.176s # pods: 50
5 : Running 50... real  0m10.180s # pods: 50
6 : Running 50... real  0m10.168s # pods: 25
7 : Running 50... real  0m20.164s # pods: 34
8 : Running 50... real  0m19.300s # pods: 50
9 : Running 50... real  0m10.193s # pods: 50
10 : Running 50... real 0m10.180s # pods: 50
11 : Running 50... real 0m20.177s # pods: 50
12 : Running 50... real 0m10.174s # pods: 49
13 : Running 50... real 0m10.187s # pods: 26
14 : Running 50... real 0m10.167s # pods: 50
15 : Running 50... real 0m20.191s # pods: 32
16 : Running 50... real 0m18.220s # pods: 50
17 : Running 50... real 0m10.186s # pods: 50
18 : Running 50... real 0m10.174s # pods: 50
19 : Running 50... real 0m20.141s # pods: 50
20 : Running 50... real 0m10.190s # pods: 35
Service 'echo' successfully deleted in namespace 'default'.

Oct 17 '20 23:10 duglin

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jan 16 '21 02:01 github-actions[bot]

/reopen /remove-lifecycle stale

Feb 16 '21 04:02 duglin

@vagababov @julz

Is this still an issue? Would this be a "good first issue" in the autoscaling area?

/triage needs-user-input

(I'll also point out that this bug timed out, so if it's a major issue, we may need to reconsider our priorities. If it's not a major issue, we may want to consider allowing it to time out again.)

Mar 22 '21 03:03 evankanderson

Is this still an issue? Would this be a "good first issue" in the autoscaling area?

Unfortunately not. What sounds easy is actually a bit tricky because of how we do metric aggregation. Having said that it's possible the new pluggable aggregation stuff @vagababov has added may make this more tractable 🤔 .

Mar 22 '21 09:03 julz

/remove-triage needs-user-input /triage accepted

I'm not sure that Victor is going to land anything here; is this still an issue, and what priority?

Jun 23 '21 20:06 evankanderson

I do think it's a legit issue ("we do not distinguish failed/slow scrapes from zero concurrency and we should or we'll potentially scale down due to network blips") that needs more work to progress than a "good first issue" should. If we had a 'this is something someone who wants something meaty could work on' tag, Id add that to this.

Jun 23 '21 20:06 julz

/help

Jun 24 '21 01:06 evankanderson

@evankanderson: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 24 '21 01:06 knative-prow-robot

/assign

Mar 09 '22 21:03 nader-ziada

/unassign

Feb 28 '25 21:02 nader-ziada

/assign

Sep 25 '25 21:09 Alexander-Kita

After researching some possible approaches, here is what I found as possible options:

If the metrics cannot be scraped, avoid scaling decisions overall. This could be a new mechanism (waiting until the window refreshes with new data before any scaling decisions) or simulated using calculations put into the window instead of 0 that would result in the same number of pods (this may affect later scaling decisions when data becomes available again).
Use the rolling average instead of 0 (seems to be discussed above)
The default scale-down-delay option exists which could mitigate this issue if we want to adjust the default value
If pods fail to scrape, use the highest recent calculated concurrency or average
If a pod fails to scrape, calculate the average from statistics that have been gathered. In the case of a scale down, assume the maximum concurrency/rps specified on the service for those pods (unsure what number to use in the case of unspecified maximum concurrency/rps, open to suggestion). In the case of a scale up, assume 0. After this, recompute. If the scaling direction changes, avoid scaling decisions. This is how the Kubernetes HPA handles autoscaling when metrics are unavailable on pods.

Please let me know any thought and which approach you think is best. Also, depending on the approach, how should it interact with the option flag that uses the maximum for autoscaler decisions?

Nov 04 '25 01:11 Alexander-Kita

Hey, I haven't thought about this deeply, but it might be worth thinking about the various options in a few different scenarios:

Scenario A

A small number of hosts on the cluster are overloaded and/or have poor reachability from the autoscaler. In this case, the autoscaler knows about N replicas, but only receives reports from M < N replicas. What reasonable assumptions can we make about the remaining N - M replicas?

Scenario B

The autoscaler itself or the cluster network is temporarily overloaded or cut off. In this case, the autoscaler knows about N replicas, but receives 0 reports. IMO, in this case, the best behavior is to "freeze in place" and do no harm by avoiding scaling the service up or down. What happens when we exit the communications blackout? Will we temporarily be in scenario A?

It's possible there is also a scenario C which is similar to scenario A except that the unavailable scaling metrics move around the cluster more frequently. I.e. rather than having (L, M, N, O, P) where L and M are consistently unreachable for metrics, one cycle is missing L and O, while the next cycle is missing M and P, then N and O, etc). I don't know if this affects the desired behavior or available information in any sort of useful way.

Nov 04 '25 19:11 evankanderson

@evankanderson thank you for the thorough response! After reading through and thinking about these scenarios for each method, here is what I deduced based on each method above:

In scenario A, if we avoided making scaling decisions and these pods had poor reachability, in a situation with a large amount of replicas or hosts this could cause the autoscaler to avoid scaling too often if there were intermittent network issues
This seems like the safest option since it is predictable and will stop rapid scale up and scale down and keep the amount relatively stable. However, similar to 1, if we have many pods and there are some with poor reachability (scenario A) we could end up stagnating. Maybe adding a ratio of failed scrapes vs successful scrapes could help avoid constantly staying in place if only a few pods fail to scrape (might need to be changes in how we calculate number of pods to scale to if this is desired)?
If pods in scenario A or C constantly fail to scrape for a period, the same problem is present anyways
This is similar to two, but it is on the side of keeping more replicas (similar to the use max option)
This is a more conservative option. It also has the chance to freeze in place but less so the more pods we get metrics for (which is good since it is kind of like a built in ratio). It also seems like a good option since it seems to work with the kubernetes HPA.

After this analysis, I believe the best methods to be between 2 and 5. Please let me know your thoughts.

Nov 11 '25 21:11 Alexander-Kita