prow
prow copied to clipboard
Revive prowjob when node is terminated (enabled by default)
Recreated pods when we detect they failed due to its node shutting down. This saves us from having to rerun jobs that were terminated due to a spot instance shutdown.
Current behavior
Now, when a node is terminated, the prowjob fails (PJ gets in the FailureState).
New behavior
The prowjob is now deleted and recreated (revived) when the node is terminated.
If the pod is revived more than what is allowed through the MaxRevivals config value, the prowjob errors (PJ gets in the ErrorState).
Additionally, the ErrorOnTermination option can be used to error a prowjob directly when the pod is terminated (PJ gets in the ErrorState).
Detecting node termination
The pod status that we see on pods terminated due to a spot instance shutdown look like this:
status:
message: Pod was terminated in response to imminent node shutdown.
phase: Failed
reason: Terminated
or
status:
conditions:
- message: 'PodGC: node no longer exists'
reason: DeletionByPodGC
status: "True"
type: DisruptionTarget
phase: Failed
or
status:
conditions:
- message: 'GCPControllerManager: node no longer exists'
reason: DeletionByGCPControllerManager
status: "True"
type: DisruptionTarget
phase: Failed
The TerminationConditionReasons option allows users to modify what pod condition reason values are used to detect that the node was being terminated (defaults to "DeletionByPodGC" and "DeletionByGCPControllerManager").
Deploy Preview for k8s-prow ready!
| Name | Link |
|---|---|
| Latest commit | 9c6dd679110924df9b404d52674f3fcc6491057d |
| Latest deploy log | https://app.netlify.com/sites/k8s-prow/deploys/66c84689385eeb0008126004 |
| Deploy Preview | https://deploy-preview-117--k8s-prow.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Hello @inteon AFAIK Plank was deprecated. Since this PR is adding codes under 'pkg/plank/', I wonder whether:
- Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or
- There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.
Please take a look! 😃
Hello @inteon AFAIK Plank was deprecated. Since this PR is adding codes under 'pkg/plank/', I wonder whether:
- Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or
- There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.
Please take a look! 😃
The code still seems to be used in prow-controller-manager: https://github.com/kubernetes-sigs/prow/blob/0a3518154cf7d730ddc5ad7c6b978c480a6e5dc9/cmd/prow-controller-manager/main.go#L202-L204
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: inteon Once this PR has been reviewed and has the lgtm label, please assign petr-muller for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/label tide/merge-method-squash
Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!
Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!
This change seems like a nice feature but I am not sure if you are covering all of the generic cases of those pod terminations that are not only related to GCP. Also, this issue can be resolved by adding a pod disruption budget to your cluster.
In my opinion, we shouldn't enable this feature by default. Instead, we can configure specific reasons for recreating the prowjobs, which will allow users to be more specific about their infrastructure.
I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.
@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?
I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.
@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?
I updated the PR and added a RetryCount which is now incremented every time the pod is re-created (it also counts other retries that were already present in code). There will be a hard failure after 3 retries have been reached (we might want to make this a variable in the future).
~~Blocked by https://github.com/kubernetes-sigs/prow/pull/196 since CRD is too large for my setup.~~
@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.
@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.
Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.
@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.
Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.
I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.
@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.
Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.
I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.
I would prefer to let the user to decide the list of reasons to restart the prowjob and the retry count limit. For example, in my infrastructure, we run costly jobs, and this feature can potentially increase the cost since its rerunning them by default for those specific reasons. Your solution is good, but I would prefer to make it configurable so the user won't be limited on hardcoded termination reasons and retry limits. @stevekuznetsov WDYT?
Getting the config semantic right might be hard but I'm for it.
Yeah, I share some concerns about the restart implications.
For example, with 5k node scale tests, we may prefer to simply take the failure and leave boskos to cleanup rather than attempt to start another run, and yet with the many many CI jobs we have it would be difficult to properly identify and opt out all of the relevant jobs.
cc @aojea @pohly
Also, even as a GCP employee, I think we should prefer to use portable Kubernetes, but I guess this is at least somewhat more portable now ... do any of the other major vendors with spot instances set a similar status that can be matched or do we need a different mechanism entirely?
What's the use case where the next periodic attempt and/or /retest is not sufficient?
Are you using the automatic /retest commenter? It's clunky but it has done the job ~OK.
I suspect for most jobs this is better, if bounded, but it's still a surprising behavior change and there's limited bandwidth to go comb through jobs and opt them in/out.
For Kubernetes we could probably opt-out anything reference the boskos scalability pools and be in OK shape.
I don't think we want another best practice field that everyone has to opt in though either .... (like decorate: true which should ideally not be redundant eventually)
What if we have a global option to enable this, in addition to per job opt-out? We could wait to turn this on in until we've opted out any jobs with cost concerns?
I'm also not sure about having a single retrycount is the best approach, but I haven't put much thought into it yet. E.G. failure to schedule is pretty distinct from the other modes (Thought I think we already handle that separately?)