serving Knative scaling envelope

/area scaling /area monitoring /area networking

Main question: Are there any official or anecdotal accounts of how knative scales?

Background

We have been running knative serving in production now for around one year, and I've got to thinking how will it continue to scale as the number of knative services in our cluster grows. Currently we have ~1000 services, each gets aggressively garbage collected when new revisions are deployed. I've tried to search for any documentation around how knative scales under different conditions, but could not locate any. Are there any official or anecdotal accounts of how knative scales? I'd be particularly interested in anything known with regard to

Number of services/revisions
Amount of traffic distributed amongst those services
Anything else other users would find helpful

From my initial observations, it seems as though most of the internal components perform well given the default resource allocations. It appears as though each new service adds to the memory footprint of the components (controller/activator/autoscaler/etc), and this appears linear. I expect at least for our workloads we could see it scale another 2-3x before we hit issues wrt memory. But then even those should be mitigated by throwing more resources at it.

I guess something like this could be useful.

Jun 10 '22 19:06 howespt

@howespt: The label(s) area/scaling cannot be applied, because the repository doesn't have them.

In response to this:

/area scaling /area monitoring /area networking

Main question: Are there any official or anecdotal accounts of how knative scales?

Background

We have been running knative serving in production now for around one year, and I've got to thinking how will it continue to scale as the number of knative services in our cluster grows. Currently we have ~1000 services, each gets aggressively garbage collected when new revisions are deployed. I've tried to search for any documentation around how knative scales under different conditions, but could not locate any. Are there any official or anecdotal accounts of how knative scales? I'd be particularly interested in anything known with regard to

Number of services/revisions

Amount of traffic distributed amongst those services

Anything else other users would find helpful

From my initial observations, it seems as though most of the internal components perform well given the default resource allocations. It appears as though each new service adds to the memory footprint of the components (controller/activator/autoscaler/etc), and this appears linear. I expect at least for our workloads we could see it scale another 2-3x before we hit issues wrt memory. But then even those should be mitigated by throwing more resources at it.

I guess something like this could be useful.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 10 '22 19:06 knative-prow[bot]

https://knative.dev/docs/serving/autoscaling/kpa-specific/#scale-up-rate https://knative.dev/docs/serving/load-balancing/target-burst-capacity/ maybe this help?

Jul 13 '22 14:07 jwcesign

@jwcesign I think the question was more around, do you have any metrics or how many unique services can knative handle and not how a single one can scale to multiple instances.

Jul 13 '22 14:07 pastjean

I never see someting like this. @pastjean

Jul 15 '22 03:07 jwcesign

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Oct 14 '22 01:10 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Nov 13 '22 01:11 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jun 28 '23 01:06 github-actions[bot]