metering-operator icon indicating copy to clipboard operation
metering-operator copied to clipboard

Ephemeral pods

Open mmariani opened this issue 6 years ago • 1 comments

It is my understanding that a pod's metrics are not retained upon its death, therefore jobs running for short periods (i.e. less than the scraping interval) are likely to pass under the radar of any current metering effort. By contrast, public cloud platforms are able to provide precise accounting and throttling of resources (cpu credits), although I suppose at the expense of running customized hypervisors.

Today, a customer of a container platform can run a workload in the form of a lot of small jobs and its resource usage will be underestimated, am I right? Are you aware if this is a limitation that k8s may overcome, and if there's any documented effort to work in that direction? I am aware this may depend on the container runtime as well.

Thanks

mmariani avatar Mar 22 '19 10:03 mmariani

This is mostly an artifact of how you do monitoring. We use Prometheus, so it's pull based and suffers from many issues like those described. Longer term, this is something that needs be addressed by the monitoring stack no matter what, for it's own purposes, and for ours. We're moderately aware of this issue and we're working with the Openshift monitoring team to discuss how Prometheus can handle this better.

In my past discussions we've discussed a few things that might be something that can be done:

  • You can configure Prometheus to scrape more frequently, though this can degrade performance if done for everything, so:
  • You can dedicate specific nodes to running ephemeral workloads, and then you can configure Prometheus to scrape those specific nodes more frequently.

I've tried to think of decent ways of using pushgateway also, but haven't really come up with any decent ideas.

chancez avatar Mar 22 '19 17:03 chancez