prometheus-engine
prometheus-engine copied to clipboard
Succeeded Pods are not ignored
Hi, when using PodMonitoring to monitor Pods of Jobs (esp. CronJobs), the pods are not being ignored once the Job finished and the Pod enter the 'Succeeded' phase.
This results in the following error message:
caller=log.go:124 component="scrape manager" scrape_pool=PodMonitoring/uplift/tkimporter-monitoring/9090 level=debug target=http://10.212.54.208:9090/metrics msg="Scrape failed" err="Get \"http://10.212.54.208:9090/metrics\": context deadline exceeded"
Best Regards, Felix
Thanks for reporting. CronJobs were not on our radar as I assumed people would mostly use PGW for it but ignoring succeeded pods seems valid in general.
We can generally filter by phase.
Besides Suceeded
, would you also expect pods in Failed state to be ignored?
I looked into the documentation: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase
It seems to me that only pods in the Running phase should be considered being able to provide PodMonitoring metrics. All other phases should be ignored, from my POV.
Just thought I'd drop a quick breadcrumb here to link to how prometheus-operator
handled this:
Issue: https://github.com/prometheus-operator/prometheus-operator/issues/4816 PR: https://github.com/prometheus-operator/prometheus-operator/pull/5049
in case the googleapis CRs wanted to mirror the coreos CRs
Bump?
Would be great to have this, we just got a load of corrupted metrics because of this issue
Hi @f0o,
We can take a look at supporting this. I am curious - how were metrics ingested with pods not Running
? Presumably the containers wouldn't be available to scrape metrics from.
But I may be misunderstanding.
@pintohutch The issue comes when a pod reuses the same IP as one that's not running anymore. The scraper then just uses the same Tags as the previously discovered one (since the target still exists) but the data it gets is from a whole different one.
This happens relatively often if you use Spot-Containers and they shift around a lot.
I see this about every two months on GKE Autopilot with Spot-Containers and only using 5 workloads total, so not even a lot of pods really.
Gotcha - thanks for the context @f0o. We'll take a look at fixing.
Worth to note that there is a financial incentive to this; I added a batch/cronjob to run kubectl commands to cleanup all suceeded pods and my Monitoring bill was reduced by 53%...
Even if scraping succeeded pods wouldnt cause metric issues it definitively burns money for nothing.
This bug should be re-prioritized as it is burning customer funds possibly without them even knowing
Whoop - sorry for the delayed response here @f0o. Putting something together now
This is now fixed and available in GKE