osiris icon indicating copy to clipboard operation
osiris copied to clipboard

support cpu/gpu consumption metrics as well in addition to request count

Open gurvindersingh opened this issue 5 years ago • 4 comments

Thanks for releasing this useful tool :)

What would you like to be added? Currently it seems that only metrics supported is request count. Is there any plan to monitor consumption of CPU and GPU resources in addition to request count to make decision as if a given pod is idle or not.

Why is this needed? As a user might simple submit a job and the job runs for hours before user will check it again on status. Usually in ML model training. So this will avoid the issue if killing the pod when the analysis is running.

gurvindersingh avatar Dec 13 '18 17:12 gurvindersingh

We've always assumed that in the future, scaling decisions could be made on a variety of metrics. That being said, Osiris is presently little more than a proof of concept and this feature isn't among those I imagine making the cut for a minimal viable product.

I would expect a road map to be forthcoming in January.

krancour avatar Dec 13 '18 20:12 krancour

thanks @krancour for the info. Do you plan to collect these extra metrics from prometheus or from heapster or use any other custom methods.

gurvindersingh avatar Dec 14 '18 19:12 gurvindersingh

We originally used Prometheus for collecting request counts, but eventually decided that a dependency on Prometheus, configured a certain way, possibly in addition to a Prometheus you might already be running for other purposes was an unnacceptably high barrier to entry for something that's supposed to be helping you bring resource utilization down. (You have to ask yourself where the break even point is. How many workloads do you have to scale to zero to justify the extra components you have to run to make that possible?) We eventually decided that Prometheus was, perhaps, overkill for the one metric we're currently collecting and we cut it out completely.

In the future, making scaling decisions, based on something other than request count will likely require that we re-examine how we intend to collect those metrics. tbh, when we start putting a road map together, I'm not even sure how high a priority that will be, as I can guess that things like support for https, http2, and other protocols (which you have also asked about) will probably emerge as more pressing concerns.

krancour avatar Dec 14 '18 19:12 krancour

Looking at the issue queue here and considering closing this...

I think the scope of this project is well-understood now to be pretty narrowly confined to scaling to/from zero in response, specifically, to HTTP/S requests or the absence thereof. For workloads serving HTTP/S requests, arbitrary metrics like CPU or memory pressure are not as reliable an indicator of active vs idling as simply monitoring the traffic to the pods like we are already doing.

krancour avatar May 21 '19 17:05 krancour