enhancements KEP-4443: Configurable Job failure reason for PodFailurePolicyRule

One-line PR description: Configurable Job failure reason for PodFailurePolicy

Issue link: #4443

Other comments:

Feb 05 '24 21:02 danielvegamyhre

@soltysh for sig-apps lead review tagging @ahg-g @alculquicondor @kannon92 for review as well

Feb 05 '24 21:02 danielvegamyhre

@deads2k would you be able to do an API review for this? Here we are proposing adding an optional Reason field to the PodFailurePolicyRule for the Job API, similar to the Reason field I proposed here in the Job success policy KEP.

Feb 05 '24 21:02 danielvegamyhre

@wojtek-t would you mind doing a PRR review for this KEP?

Feb 06 '24 18:02 danielvegamyhre

@alculquicondor @ahg-g wdyt about going to beta for this feature? It doesn't seem too risky (as we aren't changing the behavior of PodFailurePolicy). It may require the PRR to be more detailed so we could alpha in 1.30 and then take it to beta shortly afterwards.

Feb 06 '24 19:02 kannon92

wdyt about going to beta for this feature?

We can't. A new API field needs to go through alpha.

Feb 06 '24 19:02 alculquicondor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: danielvegamyhre Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-apps/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 06 '24 19:02 k8s-ci-robot

One thing I recommend is removing the comments as it is noise for the reviewer.

You should leave the PRR ones around but once you finish the section I'd drop the comments.

Are you referring to comments on the PR, or commented out lines in the README itself?

Feb 06 '24 21:02 danielvegamyhre

One thing I recommend is removing the comments as it is noise for the reviewer. You should leave the PRR ones around but once you finish the section I'd drop the comments.

Are you referring to comments on the PR, or commented out lines in the README itself?

Just the comments that come up with the template.

Feb 06 '24 21:02 kannon92

It would be nice to have a partial struct and some godoc to inspect the API change. On first blush I'm expecting something like

type PodFailurePolicyRule struct {
	Action PodFailurePolicyAction `json:"action" protobuf:"bytes,1,req,name=action"`

	OnExitCodes *PodFailurePolicyOnExitCodesRequirement `json:"onExitCodes" protobuf:"bytes,2,opt,name=onExitCodes"`

	OnPodConditions []PodFailurePolicyOnPodConditionsPattern `json:"onPodConditions" protobuf:"bytes,3,opt,name=onPodConditions"`

	// if unset will be defaulted
	FailJob *PodFailureFailJobOptions
}

type PodFailureFailJobOptions {
	// if unset, will be defaulted to "PodFailurePolicy"
	// if set, Value will be used to set the reason for the .status.condition[name="Failed"].  If unset, the reason will be "PodFailurePolicy"
	JobFailedReason *string
}

to roughly match with a discriminated union and as I understand it the reason is only valid it's for a FailJob.

If that's what you're expecting, then I think the API will be ok.

Feb 06 '24 23:02 deads2k

Similarly to my comment in job success/completion policy kep I'm reluctant to open up the Reason field in the conditions for broader usage. Those several concerns laid out in the Validations sections (and potentially others not yet identified), and the fact that we want users to rely on them, tells me we should prefer defining a closed set of those reasons. I read through the alternatives section and I'd be inclined to at least try that kind of approach, since based on my understanding of the provided use cases it should fulfill them. Only if during the first alpha testing phase the conclusion is that the simpler approach is not sufficient, I'd be willing to re-consider the proposed approach as a second alpha.

Overall, allowing the user to set the reason explicitly makes the yaml of the higher-order API clearer, instead of looking at the docs and try to figure out what the reason value looks like. What specific validation concerns do you have that are not covered in the validation section detailed in this KEP?

I am ok with the alternative too, but I prefer to have an explicit reason field as it gives more flexibility and offers a clearer API for APIs that build on top of Job.

Feb 07 '24 18:02 ahg-g

Overall, allowing the user to set the reason explicitly makes the yaml of the higher-order API clearer, instead of looking at the docs and try to figure out what the reason value looks like. What specific validation concerns do you have that are not covered in the validation section detailed in this KEP?

I am ok with the alternative too, but I prefer to have an explicit reason field as it gives more flexibility and offers a clearer API for APIs that build on top of Job.

I do recognize the value of having both the explicit job outcome reason and the further actions in the JobSet - it's undoubtedly much readable in that form. But my worry is around how other higher-level consumers can consume it? I think Kevin was asking about something similar in his comment about other use cases. My other main worry is that we will lose control over the Reason field, and based on my past experience we'll end up patching the validation around it, which after loosening validation is harder. That's why I'm leaning towards a closed set of values, instead.

Feb 07 '24 20:02 soltysh

But my worry is around how other higher-level consumers can consume it? I think Kevin was asking about something similar in his comment about other use cases.

As Aldo mentioned, the analogy here is labels, what one sets as value is what they have to use in the selector, and we are giving a concrete example of that in JobSet. The general case is that the controller creating the Job has control over the field (either by embedding the JobSep in the higher-level API or even if the Job is being created indirectly), would including the above in the motivation clarify the point?

My other main worry is that we will lose control over the Reason field, and based on my past experience we'll end up patching the validation around it, which after loosening validation is harder. That's why I'm leaning towards a closed set of values, instead.

How hard can it be to validate the field? if we have a defined set of rules, then we should be able to program those rules. So I ask the question again: are we missing to include something in the validation section? @deads2k as API reviewer, do you find the validation of the reason field as a major problem that would block this approach?

Feb 07 '24 20:02 ahg-g

My other main worry is that we will lose control over the Reason field

What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?

Feb 07 '24 21:02 alculquicondor

My other main worry is that we will lose control over the Reason field

What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?

I'm wondering if a new dedicated field will bring more complexity and confusion to the Job state. This may be similar to status.phase vs status.conditions in the pod status, but I understand that the new dedicated field doesn't bring complexity as much as the pod's status.phase.

Feb 07 '24 21:02 tenzen-y

My other main worry is that we will lose control over the Reason field

What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?

And we default it Reason field if the reason field (currently named SetConditionReason but we would rename it to SetFailureReason) is not set in podFailurePolicy.

Feb 08 '24 01:02 ahg-g

My other main worry is that we will lose control over the Reason field

What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?

This could work. Perhaps we could call the new JobStatus field JobFailureReason? It could also just be FailureReason, but since other fields like Failed,Succeeded etc are referring to pod counts, the more explicit name will make it clear we are talking about the Job failure, not anything related to its pod failures.

Feb 08 '24 01:02 danielvegamyhre

I ran into a similar problem while working on the Declarative Node Maintenance KEP. So far I am using a similar approach as described in this PR, but I would like to find an alternative as it is not really an elegant solution.

The real problem comes when you have multiple controllers setting the same condition and multiple clients observing it. Then there is no real consensus on what the API/value of reason really is and how it should be processed programmatically.

When you only have a user defined value as the reason, how should other actors/higher abstraction tools react to it? What is the contract?
If there is a prefix/separator, what should the prefix or separator be? And how does this get communicated to all the actors?
User defined values can also bring chaos into this. What if they encode some value in the reason for another party to process? This adds additional complexity to reasoning about the reason.
Although unlikely, we are opening a possibility of external controllers managing jobs with https://github.com/kubernetes/enhancements/issues/4368 which could also fragment the reason format.

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

In my scenario I would like to have the following condition Type="EvacuationRequest", Reason="NodeMaintenance", Message="Upgrade to 1.29" and an actuator (or a list of actuators) that have triggered this condition.

What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?

I think the metadata ultimately should be in the condition. A new field would add another layer of indirection. And it is not really scalable (different conditions, multiple reasons). A safer way would be to put that information into annotations.

I understand that enhancing the Condition API might be a big change and problematic/cause issues. But I would like to hear sig-api-machinery opinion on this (@deads2k).

Feb 08 '24 01:02 atiratree

Thanks @atiratree, Job has its own Condition API, it is not shared with Pod.

When you only have a user defined value as the reason, how should other actors/higher abstraction tools react to it? What is the contract?

The contract is defined in the PodFailurePolicy spec.

User defined values can also bring chaos into this. What if they encode some value in the reason for another party to process? This adds additional complexity to reasoning about the reason.

Can you be more concrete? what would that complexity be in the case of Job and failure reasons?

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason".

Feb 08 '24 02:02 ahg-g

Job has its own Condition API, it is not shared with Pod.

It is not, but we try to adhere to the same standard/conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason".

I meant it would be best to add it to the JobCondition, but this ties into the point above ^

The contract is defined in the PodFailurePolicy spec.

Yes, it is still a contract, but a looser one. As an entity observing just the conditions, you cannot predict what values will be there and how to react to them.

Can you be more concrete? what would that complexity be in the case of Job and failure reasons?

I cannot predict what kind of values users will encode there. A client has to know how to react to PodFailurePolicy, PodFailurePolicyBecauseXFailed, XFailed, XFailedButDoYInstead, XFailedButDoZInstead. One can imagine even longer examples..

^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API.

Feb 08 '24 13:02 atiratree

I think there's a a lot of ideas floating around and disagreements, which I'd like to clarify first before pushing this topic forward. I understand that the sooner it gets out the better, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call, and work out the solution. Just to summarize, currently we have four potential solutions:

human provided Reason
machine provided Reason
new status field
additional condition with human provided information

Feb 08 '24 13:02 soltysh

I cannot predict what kind of values users will encode there. A client has to know how to react to PodFailurePolicy, PodFailurePolicyBecauseXFailed, XFailed, XFailedButDoYInstead, XFailedButDoZInstead. One can imagine even longer examples..

^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API.

Mistakes are also possible with labels and selectors, but the general point is that in this case the user sets the failure policy (via pod failure policy) and so by definition the failure reason will not be predictable.

But if the argument is that reason has to be assigned from a previously-defined and finite set of values (because it must be predictable without needing to look at the podFailurePolicy spec), then I agree we should just focus our discussion on adding a new field because we can't satisfy the predictability requirement.

, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call

I am not sure we can get to a conclusion at wg-batch since this is also an API question and so we need an api approver to be engaged in the discussion.

human provided Reason

What this option is really about is giving a name or identifier to the podFailurePolicy rule, but it wouldn't satisfy the predictability "requirement".

machine provided Reason

To me the main issue with the machine provided reason is how to encode a set or range exit codes. In the above option we are practically giving the rule an identifier, so if we want the machine to identify the rule in the reason automatically, the only option I see is providing the rule number in the list (like PodFailurePolicy-Rule0 or PodFailurePolicy-Rule1 to indicate the first and the second rule respectively).

@soltysh since you are preferring to this option, how can we address @atiratree concern about the reason being predictable?

new status field

This seems the closest to address all concerns I heard above since we are not changing the Reason semantics. So I am supportive of this option.

additional condition with human provided information

I assume you are referring to adding new field to JobCondition type, if so then this needs wider consensus across the project since it needs to apply to all APIs as per [1], so we can't resolve this at wg-batch meeting. I am wondering how we want to approach this option?

[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

Feb 08 '24 14:02 ahg-g

@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines

Apr 02 '24 17:04 danielvegamyhre

@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines

IIRC, the current is still within the v1.30 cycle. So, after the v1.30 cycle, we should find the next timeline.

Apr 02 '24 17:04 tenzen-y

Lets work on this and get it merged early, we don't need to wait for the 1.31 timelines to be posted.

Apr 02 '24 19:04 ahg-g

@soltysh @ahg-g I revised the KEP based on our discussion in the WG Batch meeting.

Summary:

This KEP proposes to extend the Job API by adding an optional Name field to PodFailurePolicyRule. If unset, it would default to the index of the rule in the podFailurePolicy.rules slice.

When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the JobFailed condition reason, in the format: PodFailurePolicy_{ruleName}.

Apr 09 '24 00:04 danielvegamyhre

/wg batch

Apr 10 '24 11:04 kannon92

/label tide/merge-method-squash

Jun 04 '24 16:06 soltysh

Clarifying the default is the last bit to get this approved.

Updated the defaulting section to specify there will be no defaulting, and if Name is unset the Job controller will use the rule index in the condition reason suffix instead of the name.

Jun 07 '24 16:06 danielvegamyhre

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre, jpbetz, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [jpbetz,soltysh]
~~keps/sig-apps/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jun 11 '24 18:06 k8s-ci-robot

/hold cancel

Thanks a lot all for the great feedback, I think we got to a very good conclusion here! kudos to the community!

Jun 11 '24 18:06 ahg-g