serving Bound Concurrency of Pod Scraping (Probably with a Work Pool)

Describe the feature

We've discussed this a few times in slack but it seems worth having an issue for it and I didn't spot one, so here one is :).

As of right now we spawn pod scrapes in parallel for ~every revision with no real bound on the amount we'll attempt at once across revisions. This means we can potentially have an extremely large number of active outgoing scrapes, causing more contention than needed, exceeding keep-alive connection pool size and buffer pools etc. In our larger environments, we see large numbers of connections waiting on DNS and GC when revisions*pods gets large.

We should probably introduce some sort of work pool for the scrapers.

Jun 19 '20 08:06 julz

In the same vain, we should likely switch the probes into a workqueue based solution (like net-XXX probers) to be able to more easily control the lifecycle of the respective probes.

Jun 19 '20 08:06 markusthoemmes

For the color revisions*pods is exactly 16*revisions, when pod count goes towards infinity, as in it's bounded in number of pods, but not in number of revisions.

Jun 19 '20 17:06 vagababov

In the same vain, we should likely switch the probes into a workqueue based solution (like net-XXX probers) to be able to more easily control the lifecycle of the respective probes.

Here, I am a bit on the border, given that we probably have requests queueing up in activator that we want to start forwarding as fast as possible. And until the pod is probed it receives no requests... so 🤷

Jun 19 '20 17:06 vagababov

Not sure we talk about the same thing @vagababov? What does the activator have to do with pod probing in the autoscaler? :thinking:

Jun 19 '20 17:06 markusthoemmes

We only probe during scale to 0. If we consider 500 revisions all going to 0 at the same time, it's unfortunate, but a much more corner case than just 500 revisions that we need to scrape :)

Jun 19 '20 18:06 vagababov

But we do a lot of probing in activator :)

Jun 19 '20 18:06 vagababov

Okay, we clarified offline: In my comment about switching to workqueue based stuff I meant metric scraping, not probing. Sorry for the confusion.

Jun 19 '20 18:06 markusthoemmes

@julz @vagababov @markusthoemmes I could help here if possible.

Jul 09 '20 10:07 skonto

Yep, just know that this might be a somewhat bigger refactor of the current scraper code.

I think we first need to agree that we actually want work-pooling here as it comes with the risk of us setting the concurrency too low.

Jul 09 '20 11:07 markusthoemmes

so how do we want to proceed on this? Would a feature proposal doc to hash out the details and any foreseeable edge cases be the best next step?

Jul 10 '20 07:07 julz

I believe so @julz

Jul 10 '20 07:07 markusthoemmes

cool - @skonto happy to work together on that if you like?

Jul 10 '20 07:07 julz

@julz sure let's do it :) I will start a doc and we can iterate on it. I will ping you.

Jul 10 '20 22:07 skonto

Note, that this has to be a configurable feature, since a user might wish to throw CPU at the AS to ensure enough resources to support more revisions. Finally, I'd like us to think about this with the scope of shared autoscaler that @yanweiguo is working on. As in, will creating more autoscalers solve this problem?

Jul 13 '20 15:07 vagababov

@vagababov, as you probably noticed we (cc @julz @markusthoemmes) had a long discussion on slack (autoscaling channel), so absolutely that is one option that we discussed, but still within the context of one instance there should be space for improvements. I will try to summarize here. What can be improved needs verification in any case.

The total max cost is max_scraped_pods_per_revision*number_of_revisions. The first factor is 16.

We could either make 16 less or limit the two factors at any time eg. control scraping or revisions served.

Using a workqueue for limiting scraping has some other benefits, besides limiting concurrency, for example as @markusthoemmes mentioned:

That'd enable us to easily enable/disable scraping for specific revision, which then in turn enables us to dynamically disable scraping if the activator is put into the routing path.

But also there are challenges as mentioned in #8610. Note here though that we already use a workqueue for probing.

In addition to that, the fact of currently having a max of 16 pods scraped per revision could be less assuming we have a smarter algorithm besides relying on the normal distribution. Maybe a time-series algo could do better here with less pods used in scraping.

The key question, as discussed, is whether we should limit the number of concurrently served revisions or how much scraping we do per revision at any time. For the later a related issue is this one. In that issue max keep alive was increased but this could lead to running out of file descriptors if we keep increasing it.

IMHO, from a UX and operations perspective we could make configurable both the number of revisions served and the workqueue concurrency or other options (if we go down that path). In analogy to how http servers limit requests we can set a global maximum anyway per AS instance with some good default. Also we could think of more advanced concepts like QoS and priorities per group of revisions (not sure if that concept exists at the moment).

Jul 13 '20 22:07 skonto

@vagababov @julz should we move on with a workqueue approach, have a first version and take it from there?

Jul 20 '20 14:07 skonto

We might want to think about the guarding measures we want to take first maybe. That makes accepting/denying this a no-brainer imo.

How do we want to measure success/failure or rather better/worse once we do this?

Jul 20 '20 14:07 markusthoemmes

@markusthoemmes from my POV, my definition of success would be to add configuration capabilities keeping the same performance as in the unbounded case and avoid corner cases as described in #8610.

Jul 20 '20 14:07 skonto

Hm, but wouldn't we want to show that implementing this generates an advantage over the existing way of doing things (which doesn't require hand-tuning either).

We might want to produce a micro-benchmark that gives us an idea on the scale needed to reap benefits and the scale needed to push things so far that they start to detriment.

Jul 20 '20 15:07 markusthoemmes

The way I see this is that the workqueue imposes a rate limiting strategy in order to optimize revision scraping throughput and latency. The current approach from operations perspective implicitly requires hand-tuning if you have to increase your connection pool size in large envs and btw there is no control with busty traffic. Setting a configurable limit brings its own value that is my POV if it protects you from failures, keeps system stable and helps with making progress with pending processing eg. revision processing. For the latter we had the question about what matters first revision progress or scraping process in general.
So, yes we also need to show how the workqueue will help us without increasing connection pools in large envs, because if at the end of the day the unbounded approach makes progress and just queues up work then who cares. However I suspect there are issues already eg. GC as stated in the description.

Jul 20 '20 17:07 skonto

I would also want to see how StatefulSet AS scaleout works, since I don't want us to compilicate the system if there's no necessity to.

Jul 20 '20 17:07 vagababov

@vagababov reading these benchmarks from the ants lib shows that most likely unbounded goroutines (independently from scraping) are not the best choice beyond a concurrency number eg. 100K tasks in that benchmark. So I think we need a better approach anyway because I suspect that although scaling out would solve the problem, an optimized instance could lead to less replicas when needed.

Jul 21 '20 10:07 skonto

100K go routines is probably strongly above what we support per task in any case....

Jul 21 '20 20:07 vagababov

@julz describes an issue that occurs in large envs. So we need to clarify what is the number of pods being scraped when issues occur. Btw in theory if we follow the guidelines for building large clusters, we could have thousands of pods and the limit is 150K. However, these benchmarks I mentioned depend on the processing being done which is a dummy function. I dont think the threshold is the same with a realistic workload. In the latter case cost per scraping request could be way more in ms, could use more resources because of http connections in exceeded keep-alive pools etc . I think a benchmark could shed more light on this as also suggested by @markusthoemmes (assuming its realistic enough).

Jul 21 '20 21:07 skonto

Yeah let's write a Golang benchmark (I don't think an E2E one is necessary) and see if we can find a break even point between just spawning new goroutines and pooling them for our workloads.

Jul 22 '20 05:07 markusthoemmes

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Oct 21 '20 01:10 github-actions[bot]

/reopen

/lifecycle frozen

May 17 '22 18:05 nader-ziada

@nader-ziada: Reopened this issue.

In response to this:

/reopen

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 17 '22 18:05 knative-prow[bot]

/triage accepted /milestone Icebox

Aug 02 '22 18:08 dprotaso