KEP-4443: Configurable Job failure reason for PodFailurePolicyRule
- One-line PR description: Configurable Job failure reason for PodFailurePolicy
- Issue link: #4443
- Other comments:
@soltysh for sig-apps lead review tagging @ahg-g @alculquicondor @kannon92 for review as well
@deads2k would you be able to do an API review for this? Here we are proposing adding an optional Reason field to the PodFailurePolicyRule for the Job API, similar to the Reason field I proposed here in the Job success policy KEP.
@wojtek-t would you mind doing a PRR review for this KEP?
@alculquicondor @ahg-g wdyt about going to beta for this feature? It doesn't seem too risky (as we aren't changing the behavior of PodFailurePolicy). It may require the PRR to be more detailed so we could alpha in 1.30 and then take it to beta shortly afterwards.
wdyt about going to beta for this feature?
We can't. A new API field needs to go through alpha.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: danielvegamyhre Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign soltysh for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
One thing I recommend is removing the comments as it is noise for the reviewer.
You should leave the PRR ones around but once you finish the section I'd drop the comments.
Are you referring to comments on the PR, or commented out lines in the README itself?
One thing I recommend is removing the comments as it is noise for the reviewer. You should leave the PRR ones around but once you finish the section I'd drop the comments.
Are you referring to comments on the PR, or commented out lines in the README itself?
Just the comments that come up with the template.
It would be nice to have a partial struct and some godoc to inspect the API change. On first blush I'm expecting something like
type PodFailurePolicyRule struct {
Action PodFailurePolicyAction `json:"action" protobuf:"bytes,1,req,name=action"`
OnExitCodes *PodFailurePolicyOnExitCodesRequirement `json:"onExitCodes" protobuf:"bytes,2,opt,name=onExitCodes"`
OnPodConditions []PodFailurePolicyOnPodConditionsPattern `json:"onPodConditions" protobuf:"bytes,3,opt,name=onPodConditions"`
// if unset will be defaulted
FailJob *PodFailureFailJobOptions
}
type PodFailureFailJobOptions {
// if unset, will be defaulted to "PodFailurePolicy"
// if set, Value will be used to set the reason for the .status.condition[name="Failed"]. If unset, the reason will be "PodFailurePolicy"
JobFailedReason *string
}
to roughly match with a discriminated union and as I understand it the reason is only valid it's for a FailJob.
If that's what you're expecting, then I think the API will be ok.
Similarly to my comment in job success/completion policy kep I'm reluctant to open up the Reason field in the conditions for broader usage. Those several concerns laid out in the Validations sections (and potentially others not yet identified), and the fact that we want users to rely on them, tells me we should prefer defining a closed set of those reasons. I read through the alternatives section and I'd be inclined to at least try that kind of approach, since based on my understanding of the provided use cases it should fulfill them. Only if during the first alpha testing phase the conclusion is that the simpler approach is not sufficient, I'd be willing to re-consider the proposed approach as a second alpha.
Overall, allowing the user to set the reason explicitly makes the yaml of the higher-order API clearer, instead of looking at the docs and try to figure out what the reason value looks like. What specific validation concerns do you have that are not covered in the validation section detailed in this KEP?
I am ok with the alternative too, but I prefer to have an explicit reason field as it gives more flexibility and offers a clearer API for APIs that build on top of Job.
Overall, allowing the user to set the reason explicitly makes the yaml of the higher-order API clearer, instead of looking at the docs and try to figure out what the reason value looks like. What specific validation concerns do you have that are not covered in the validation section detailed in this KEP?
I am ok with the alternative too, but I prefer to have an explicit reason field as it gives more flexibility and offers a clearer API for APIs that build on top of Job.
I do recognize the value of having both the explicit job outcome reason and the further actions in the JobSet - it's undoubtedly much readable in that form. But my worry is around how other higher-level consumers can consume it? I think Kevin was asking about something similar in his comment about other use cases. My other main worry is that we will lose control over the Reason field, and based on my past experience we'll end up patching the validation around it, which after loosening validation is harder. That's why I'm leaning towards a closed set of values, instead.
But my worry is around how other higher-level consumers can consume it? I think Kevin was asking about something similar in his comment about other use cases.
As Aldo mentioned, the analogy here is labels, what one sets as value is what they have to use in the selector, and we are giving a concrete example of that in JobSet. The general case is that the controller creating the Job has control over the field (either by embedding the JobSep in the higher-level API or even if the Job is being created indirectly), would including the above in the motivation clarify the point?
My other main worry is that we will lose control over the Reason field, and based on my past experience we'll end up patching the validation around it, which after loosening validation is harder. That's why I'm leaning towards a closed set of values, instead.
How hard can it be to validate the field? if we have a defined set of rules, then we should be able to program those rules. So I ask the question again: are we missing to include something in the validation section? @deads2k as API reviewer, do you find the validation of the reason field as a major problem that would block this approach?
My other main worry is that we will lose control over the Reason field
What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?
My other main worry is that we will lose control over the Reason field
What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?
I'm wondering if a new dedicated field will bring more complexity and confusion to the Job state.
This may be similar to status.phase vs status.conditions in the pod status, but I understand that the new dedicated field doesn't bring complexity as much as the pod's status.phase.
My other main worry is that we will lose control over the Reason field
What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?
And we default it Reason field if the reason field (currently named SetConditionReason but we would rename it to SetFailureReason) is not set in podFailurePolicy.
My other main worry is that we will lose control over the Reason field
What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?
This could work. Perhaps we could call the new JobStatus field JobFailureReason? It could also just be FailureReason, but since other fields like Failed,Succeeded etc are referring to pod counts, the more explicit name will make it clear we are talking about the Job failure, not anything related to its pod failures.
I ran into a similar problem while working on the Declarative Node Maintenance KEP. So far I am using a similar approach as described in this PR, but I would like to find an alternative as it is not really an elegant solution.
The real problem comes when you have multiple controllers setting the same condition and multiple clients observing it. Then there is no real consensus on what the API/value of reason really is and how it should be processed programmatically.
- When you only have a user defined value as the reason, how should other actors/higher abstraction tools react to it? What is the contract?
- If there is a prefix/separator, what should the prefix or separator be? And how does this get communicated to all the actors?
- User defined values can also bring chaos into this. What if they encode some value in the reason for another party to process? This adds additional complexity to reasoning about the reason.
- Although unlikely, we are opening a possibility of external controllers managing jobs with https://github.com/kubernetes/enhancements/issues/4368 which could also fragment the reason format.
We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.
In my scenario I would like to have the following condition Type="EvacuationRequest", Reason="NodeMaintenance", Message="Upgrade to 1.29" and an actuator (or a list of actuators) that have triggered this condition.
What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?
I think the metadata ultimately should be in the condition. A new field would add another layer of indirection. And it is not really scalable (different conditions, multiple reasons). A safer way would be to put that information into annotations.
I understand that enhancing the Condition API might be a big change and problematic/cause issues. But I would like to hear sig-api-machinery opinion on this (@deads2k).
Thanks @atiratree, Job has its own Condition API, it is not shared with Pod.
When you only have a user defined value as the reason, how should other actors/higher abstraction tools react to it? What is the contract?
The contract is defined in the PodFailurePolicy spec.
User defined values can also bring chaos into this. What if they encode some value in the reason for another party to process? This adds additional complexity to reasoning about the reason.
Can you be more concrete? what would that complexity be in the case of Job and failure reasons?
We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.
A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason".
Job has its own Condition API, it is not shared with Pod.
It is not, but we try to adhere to the same standard/conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.
A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason".
I meant it would be best to add it to the JobCondition, but this ties into the point above ^
The contract is defined in the PodFailurePolicy spec.
Yes, it is still a contract, but a looser one. As an entity observing just the conditions, you cannot predict what values will be there and how to react to them.
Can you be more concrete? what would that complexity be in the case of Job and failure reasons?
I cannot predict what kind of values users will encode there. A client has to know how to react to PodFailurePolicy, PodFailurePolicyBecauseXFailed, XFailed, XFailedButDoYInstead, XFailedButDoZInstead. One can imagine even longer examples..
^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API.
I think there's a a lot of ideas floating around and disagreements, which I'd like to clarify first before pushing this topic forward. I understand that the sooner it gets out the better, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call, and work out the solution. Just to summarize, currently we have four potential solutions:
- human provided Reason
- machine provided Reason
- new status field
- additional condition with human provided information
I cannot predict what kind of values users will encode there. A client has to know how to react to PodFailurePolicy, PodFailurePolicyBecauseXFailed, XFailed, XFailedButDoYInstead, XFailedButDoZInstead. One can imagine even longer examples..
^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API.
Mistakes are also possible with labels and selectors, but the general point is that in this case the user sets the failure policy (via pod failure policy) and so by definition the failure reason will not be predictable.
But if the argument is that reason has to be assigned from a previously-defined and finite set of values (because it must be predictable without needing to look at the podFailurePolicy spec), then I agree we should just focus our discussion on adding a new field because we can't satisfy the predictability requirement.
, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call
I am not sure we can get to a conclusion at wg-batch since this is also an API question and so we need an api approver to be engaged in the discussion.
human provided Reason
What this option is really about is giving a name or identifier to the podFailurePolicy rule, but it wouldn't satisfy the predictability "requirement".
machine provided Reason
To me the main issue with the machine provided reason is how to encode a set or range exit codes. In the above option we are practically giving the rule an identifier, so if we want the machine to identify the rule in the reason automatically, the only option I see is providing the rule number in the list (like PodFailurePolicy-Rule0 or PodFailurePolicy-Rule1 to indicate the first and the second rule respectively).
@soltysh since you are preferring to this option, how can we address @atiratree concern about the reason being predictable?
new status field
This seems the closest to address all concerns I heard above since we are not changing the Reason semantics. So I am supportive of this option.
additional condition with human provided information
I assume you are referring to adding new field to JobCondition type, if so then this needs wider consensus across the project since it needs to apply to all APIs as per [1], so we can't resolve this at wg-batch meeting. I am wondering how we want to approach this option?
[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines
@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines
IIRC, the current is still within the v1.30 cycle. So, after the v1.30 cycle, we should find the next timeline.
Lets work on this and get it merged early, we don't need to wait for the 1.31 timelines to be posted.
@soltysh @ahg-g I revised the KEP based on our discussion in the WG Batch meeting.
Summary:
This KEP proposes to extend the Job API by adding an optional Name field to PodFailurePolicyRule. If unset, it would default to the index of the rule in the podFailurePolicy.rules slice.
When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the JobFailed condition reason, in the format: PodFailurePolicy_{ruleName}.
/wg batch
/label tide/merge-method-squash
Clarifying the default is the last bit to get this approved.
Updated the defaulting section to specify there will be no defaulting, and if Name is unset the Job controller will use the rule index in the condition reason suffix instead of the name.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: danielvegamyhre, jpbetz, soltysh
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~keps/prod-readiness/OWNERS~~ [jpbetz,soltysh]
- ~~keps/sig-apps/OWNERS~~ [soltysh]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/hold cancel
Thanks a lot all for the great feedback, I think we got to a very good conclusion here! kudos to the community!