prow icon indicating copy to clipboard operation
prow copied to clipboard

Revive prowjob when node is terminated (enabled by default)

Open inteon opened this issue 1 year ago • 34 comments

Recreated pods when we detect they failed due to its node shutting down. This saves us from having to rerun jobs that were terminated due to a spot instance shutdown.

Current behavior

Now, when a node is terminated, the prowjob fails (PJ gets in the FailureState).

New behavior

The prowjob is now deleted and recreated (revived) when the node is terminated.

If the pod is revived more than what is allowed through the MaxRevivals config value, the prowjob errors (PJ gets in the ErrorState).

Additionally, the ErrorOnTermination option can be used to error a prowjob directly when the pod is terminated (PJ gets in the ErrorState).

Detecting node termination

The pod status that we see on pods terminated due to a spot instance shutdown look like this:

status:
  message: Pod was terminated in response to imminent node shutdown.
  phase: Failed
  reason: Terminated

or

status:
  conditions:
  - message: 'PodGC: node no longer exists'
    reason: DeletionByPodGC
    status: "True"
    type: DisruptionTarget
  phase: Failed

or

status:
  conditions:
  - message: 'GCPControllerManager: node no longer exists'
    reason: DeletionByGCPControllerManager
    status: "True"
    type: DisruptionTarget
  phase: Failed

The TerminationConditionReasons option allows users to modify what pod condition reason values are used to detect that the node was being terminated (defaults to "DeletionByPodGC" and "DeletionByGCPControllerManager").

inteon avatar Apr 18 '24 13:04 inteon

Deploy Preview for k8s-prow ready!

Name Link
Latest commit 9c6dd679110924df9b404d52674f3fcc6491057d
Latest deploy log https://app.netlify.com/sites/k8s-prow/deploys/66c84689385eeb0008126004
Deploy Preview https://deploy-preview-117--k8s-prow.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Apr 18 '24 13:04 netlify[bot]

Hello @inteon AFAIK Plank was deprecated. Since this PR is adding codes under 'pkg/plank/', I wonder whether:

  1. Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or
  2. There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.

Please take a look! 😃

jihoon-seo avatar Apr 18 '24 14:04 jihoon-seo

Hello @inteon AFAIK Plank was deprecated. Since this PR is adding codes under 'pkg/plank/', I wonder whether:

  1. Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or
  2. There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.

Please take a look! 😃

The code still seems to be used in prow-controller-manager: https://github.com/kubernetes-sigs/prow/blob/0a3518154cf7d730ddc5ad7c6b978c480a6e5dc9/cmd/prow-controller-manager/main.go#L202-L204

inteon avatar Apr 18 '24 14:04 inteon

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: inteon Once this PR has been reviewed and has the lgtm label, please assign petr-muller for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Apr 20 '24 11:04 k8s-ci-robot

/label tide/merge-method-squash

jihoon-seo avatar Apr 22 '24 01:04 jihoon-seo

Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!

inteon avatar May 07 '24 09:05 inteon

Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!

This change seems like a nice feature but I am not sure if you are covering all of the generic cases of those pod terminations that are not only related to GCP. Also, this issue can be resolved by adding a pod disruption budget to your cluster.

In my opinion, we shouldn't enable this feature by default. Instead, we can configure specific reasons for recreating the prowjobs, which will allow users to be more specific about their infrastructure.

droslean avatar May 07 '24 09:05 droslean

I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.

@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?

droslean avatar Jun 14 '24 13:06 droslean

I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.

@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?

I updated the PR and added a RetryCount which is now incremented every time the pod is re-created (it also counts other retries that were already present in code). There will be a hard failure after 3 retries have been reached (we might want to make this a variable in the future).

inteon avatar Jun 14 '24 15:06 inteon

~~Blocked by https://github.com/kubernetes-sigs/prow/pull/196 since CRD is too large for my setup.~~

inteon avatar Jul 01 '24 13:07 inteon

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

stevekuznetsov avatar Jul 01 '24 14:07 stevekuznetsov

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

droslean avatar Jul 01 '24 14:07 droslean

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.

inteon avatar Jul 01 '24 14:07 inteon

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.

I would prefer to let the user to decide the list of reasons to restart the prowjob and the retry count limit. For example, in my infrastructure, we run costly jobs, and this feature can potentially increase the cost since its rerunning them by default for those specific reasons. Your solution is good, but I would prefer to make it configurable so the user won't be limited on hardcoded termination reasons and retry limits. @stevekuznetsov WDYT?

droslean avatar Jul 01 '24 15:07 droslean

Getting the config semantic right might be hard but I'm for it.

stevekuznetsov avatar Jul 01 '24 16:07 stevekuznetsov

Yeah, I share some concerns about the restart implications.

For example, with 5k node scale tests, we may prefer to simply take the failure and leave boskos to cleanup rather than attempt to start another run, and yet with the many many CI jobs we have it would be difficult to properly identify and opt out all of the relevant jobs.

cc @aojea @pohly

Also, even as a GCP employee, I think we should prefer to use portable Kubernetes, but I guess this is at least somewhat more portable now ... do any of the other major vendors with spot instances set a similar status that can be matched or do we need a different mechanism entirely?

What's the use case where the next periodic attempt and/or /retest is not sufficient? Are you using the automatic /retest commenter? It's clunky but it has done the job ~OK.

BenTheElder avatar Jul 01 '24 21:07 BenTheElder

I suspect for most jobs this is better, if bounded, but it's still a surprising behavior change and there's limited bandwidth to go comb through jobs and opt them in/out.

For Kubernetes we could probably opt-out anything reference the boskos scalability pools and be in OK shape.

I don't think we want another best practice field that everyone has to opt in though either .... (like decorate: true which should ideally not be redundant eventually)

What if we have a global option to enable this, in addition to per job opt-out? We could wait to turn this on in until we've opted out any jobs with cost concerns?

I'm also not sure about having a single retrycount is the best approach, but I haven't put much thought into it yet. E.G. failure to schedule is pretty distinct from the other modes (Thought I think we already handle that separately?)

BenTheElder avatar Jul 01 '24 21:07 BenTheElder