prometheus-engine icon indicating copy to clipboard operation
prometheus-engine copied to clipboard

Succeeded Pods are not ignored

Open fkollmann opened this issue 3 years ago • 2 comments

Hi, when using PodMonitoring to monitor Pods of Jobs (esp. CronJobs), the pods are not being ignored once the Job finished and the Pod enter the 'Succeeded' phase.

This results in the following error message:

caller=log.go:124 component="scrape manager" scrape_pool=PodMonitoring/uplift/tkimporter-monitoring/9090 level=debug target=http://10.212.54.208:9090/metrics msg="Scrape failed" err="Get \"http://10.212.54.208:9090/metrics\": context deadline exceeded"

Best Regards, Felix

fkollmann avatar Feb 17 '22 22:02 fkollmann

Thanks for reporting. CronJobs were not on our radar as I assumed people would mostly use PGW for it but ignoring succeeded pods seems valid in general.

We can generally filter by phase. Besides Suceeded, would you also expect pods in Failed state to be ignored?

fabxc avatar Feb 18 '22 07:02 fabxc

I looked into the documentation: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase

It seems to me that only pods in the Running phase should be considered being able to provide PodMonitoring metrics. All other phases should be ignored, from my POV.

fkollmann avatar Feb 18 '22 08:02 fkollmann

Just thought I'd drop a quick breadcrumb here to link to how prometheus-operator handled this:

Issue: https://github.com/prometheus-operator/prometheus-operator/issues/4816 PR: https://github.com/prometheus-operator/prometheus-operator/pull/5049

in case the googleapis CRs wanted to mirror the coreos CRs

tomasgareau avatar Dec 15 '22 16:12 tomasgareau

Bump?

Would be great to have this, we just got a load of corrupted metrics because of this issue

f0o avatar Jul 12 '23 12:07 f0o

Hi @f0o,

We can take a look at supporting this. I am curious - how were metrics ingested with pods not Running? Presumably the containers wouldn't be available to scrape metrics from.

But I may be misunderstanding.

pintohutch avatar Jul 13 '23 16:07 pintohutch

@pintohutch The issue comes when a pod reuses the same IP as one that's not running anymore. The scraper then just uses the same Tags as the previously discovered one (since the target still exists) but the data it gets is from a whole different one.

This happens relatively often if you use Spot-Containers and they shift around a lot.

I see this about every two months on GKE Autopilot with Spot-Containers and only using 5 workloads total, so not even a lot of pods really.

f0o avatar Jul 13 '23 16:07 f0o

Gotcha - thanks for the context @f0o. We'll take a look at fixing.

pintohutch avatar Jul 17 '23 17:07 pintohutch

Worth to note that there is a financial incentive to this; I added a batch/cronjob to run kubectl commands to cleanup all suceeded pods and my Monitoring bill was reduced by 53%...

Even if scraping succeeded pods wouldnt cause metric issues it definitively burns money for nothing.

This bug should be re-prioritized as it is burning customer funds possibly without them even knowing

f0o avatar Aug 29 '23 11:08 f0o

Whoop - sorry for the delayed response here @f0o. Putting something together now

pintohutch avatar Nov 17 '23 22:11 pintohutch

This is now fixed and available in GKE

pintohutch avatar Mar 22 '24 18:03 pintohutch