Add support for explicitly sanitizing jobs to popeye
Is your feature request related to a problem? Please describe. When trying to run a popeye check on my cluster, I get frustrated that pods belonging to failed iterations of jobs that eventually succeed are flagged as failures by popeye.
Describe the solution you'd like Optionally check for the final success of the job rather than if all iterations of the job succeeded.
Describe alternatives you've considered We currently white-list these failures, but that'll cause a genuine job failure to be missed.
Additional context If this is a feature Popeye wants, we could develop it and then contributing it back it to Popeye. I had a search your github issues, including closed issues and I don't think you've explicitly rejected a proposal like this before.
Thanks very much for Popeye, it's a very useful tool.
@ndavidson-pulse Thank you for this issue! I'll need to take a peek and see if we can devise a different approach with job failures. Alternatively if you don't care about the failure history on your cron you can use spec.failedJobHistoryLimits=0. Defaults to 1.
@derailed It's not that we don't care - we do want them to succeed but just within a specified window. We wrap popeye in a script that deploys our cluster from scratch and waits for the cluster to stabilize within a defined time-limit and then check, By this point all the jobs have succeeded but a couple of them may have failed first. It's fairly random and depends on exactly how quickly core services come up.
Fixed v0.20.0