kwok icon indicating copy to clipboard operation
kwok copied to clipboard

Add a pod stage file for failing an initContainer (WIP)

Open yuanchen8911 opened this issue 1 year ago • 6 comments

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a pod stage that injects an error to initContainer specified by a label and annotations, including the initContainer's name and optional failure message, reason, exit code and delay parameters.

labels:
   pod-init-container-running-failed.stage.kwok.x-k8s.io: 'true'
annotations:
   pod-init-container-running-failed.stage.kwok.x-k8s.io/container-name: <failed initContainer name>
   // optional
   pod-init-container-running-failed.stage.kwok.x-k8s.io/reason: <failure reason, the default is "initContainerError">
   pod-init-container-running-failed.stage.kwok.x-k8s.io/message: <failure message, the default is "initContainer reported errors" > 
   pod-init-container-running-failed.stage.kwok.x-k8s.io/exit-code: <exit code, the default is 1>
   pod-init-container-running-failed.stage.kwok.x-k8s.io/delay: <delay in milliseconds> 
   pod-init-container-running-failed.stage.kwok.x-k8s.io/jitter-delay: <jitter delay in milliseconds> 

Which issue(s) this PR fixes:

Fixes # None

Special notes for your reviewer:

Questions:

  1. How to set the default values for durationFrom and jitterDurationFrom
  2. Do we need to set the statues of the other initContainers after the failed initContainer?

Does this PR introduce a user-facing change?

Support the failure of an initContainer in a pod.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


yuanchen8911 avatar Apr 03 '24 18:04 yuanchen8911

Deploy Preview for k8s-kwok canceled.

Name Link
Latest commit d7d3929564ce72cf7b89ce10516d37b502d5bb48
Latest deploy log https://app.netlify.com/sites/k8s-kwok/deploys/661d4a7404994a000853919b

netlify[bot] avatar Apr 03 '24 18:04 netlify[bot]

/cc @wzshiming

yuanchen8911 avatar Apr 03 '24 18:04 yuanchen8911

Thanks for the fixes, @wzshiming! All the changes were committed. I still have two questions.

  1. Would it be possible to use some default values if the duration/jitterDuration annotations are not specified?
  2. The statuses of those initContainers defined after the failed initContainers should NOT be terminated with Completed, right?

yuanchen8911 avatar Apr 08 '24 15:04 yuanchen8911

Would it be possible to use some default values if the duration/jitterDuration annotations are not specified?

Sure, set both durationMilliseconds and durationFrom.expressionFrom, the durationMilliseconds will like a default for it.

The statues of those initContainers after the failed initContainers should NOT be terminated with Completed, right?

Yes, the Completed reason is only for the 0 exit code

https://github.com/containerd/containerd/blob/406e9e84b4fb0a6bf2912f56ff1dfe78a56ed5ee/internal/cri/server/container_status.go#L89-L95

wzshiming avatar Apr 09 '24 11:04 wzshiming

Would it be possible to use some default values if the duration/jitterDuration annotations are not specified?

Sure, set both durationMilliseconds and durationFrom.expressionFrom, the durationMilliseconds will like a default for it.

Set both durationMilliseconds and jitterDurationMillisecons to 1000 by default (no jitter).

The statues of those initContainers after the failed initContainers should NOT be terminated with Completed, right?

Yes, the Completed reason is only for the 0 exit code

https://github.com/containerd/containerd/blob/406e9e84b4fb0a6bf2912f56ff1dfe78a56ed5ee/internal/cri/server/container_status.go#L89-L95

My question was whether we should distinguish an initContainers defined before the failed initContainers and one after the failed initContainer. The former's state is terminated while the latter's is waiting.

yuanchen8911 avatar Apr 09 '24 15:04 yuanchen8911

/lgtm

Thank you, I think it's a good start for chaos testing

Removed the WIP label. Can you approve it again? Thanks!

yuanchen8911 avatar Apr 15 '24 14:04 yuanchen8911

/retest

yuanchen8911 avatar Apr 15 '24 14:04 yuanchen8911

New changes are detected. LGTM label has been removed.

Fixed the linkyaml issues (trailing spaces). PTAL, @wzshiming. Thanks!

yuanchen8911 avatar Apr 15 '24 15:04 yuanchen8911

/approve /lgtm /label tide/merge-method-squash

wzshiming avatar Apr 16 '24 01:04 wzshiming

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wzshiming, yuanchen8911

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Apr 16 '24 01:04 k8s-ci-robot