litmus
litmus copied to clipboard
Litmus Chaos Tests not running on K8s v1.27
What happened: LitmusChaos tests not running properly on Kubernetes v1.27
What you expected to happen: LitmusChaos tests should run properly on Kubernetes v1.27
Where can this issue be corrected? (optional)
The issue is probably in the source code of litmuschaos/go-runner:2.14.0
How to reproduce it (as minimally and precisely as possible): Note: Followed the instructions as per https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-cpu-hog/.
Deploy litmus operator v2.14.0
kubectl create -f https://litmuschaos.github.io/litmus/litmus-operator-v2.14.0.yaml
Deploy below ChaosExperiment:
apiVersion: litmuschaos.io/v1alpha1
description:
message: |
Injects cpu consumption on pods belonging to an app deployment
kind: ChaosExperiment
metadata:
labels:
app.kubernetes.io/component: chaosexperiment
app.kubernetes.io/part-of: litmus
app.kubernetes.io/version: 2.14.0
name: pod-cpu-hog
name: pod-cpu-hog
namespace: default
spec:
definition:
args:
- -c
- ./experiments -name pod-cpu-hog
command:
- /bin/bash
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: CPU_CORES
value: "1"
- name: CPU_LOAD
value: "100"
- name: PODS_AFFECTED_PERC
value: ""
- name: RAMP_TIME
value: ""
- name: LIB
value: litmus
- name: LIB_IMAGE
value: litmuschaos/go-runner:2.14.0
- name: SOCKET_PATH
value: /var/run/docker.sock
- name: LIB_IMAGE_PULL_POLICY
value: IfNotPresent
- name: TARGET_PODS
value: ""
- name: NODE_LABEL
value: ""
- name: SEQUENCE
value: parallel
image: litmuschaos/go-runner:2.14.0
imagePullPolicy: IfNotPresent
labels:
app.kubernetes.io/component: experiment-job
app.kubernetes.io/part-of: litmus
app.kubernetes.io/version: 2.14.0
name: pod-cpu-hog
permissions:
- apiGroups:
- ""
resources:
- pods
verbs:
- create
- delete
- get
- list
- patch
- update
- deletecollection
- apiGroups:
- ""
resources:
- events
verbs:
- create
- get
- list
- patch
- update
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- apiGroups:
- ""
resources:
- pods/log
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- get
- list
- create
- apiGroups:
- apps
resources:
- deployments
- statefulsets
- replicasets
- daemonsets
verbs:
- list
- get
- apiGroups:
- apps.openshift.io
resources:
- deploymentconfigs
verbs:
- list
- get
- apiGroups:
- ""
resources:
- replicationcontrollers
verbs:
- get
- list
- apiGroups:
- argoproj.io
resources:
- rollouts
verbs:
- list
- get
- apiGroups:
- batch
resources:
- jobs
verbs:
- create
- list
- get
- delete
- deletecollection
- apiGroups:
- litmuschaos.io
resources:
- chaosengines
- chaosexperiments
- chaosresults
verbs:
- create
- list
- get
- patch
- update
- delete
scope: Namespaced
Create below RBAC:
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/part-of: litmus
name: pod-cpu-hog-sa
name: pod-cpu-hog-sa
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
app.kubernetes.io/part-of: litmus
name: pod-cpu-hog-sa
name: pod-cpu-hog-sa
namespace: default
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- create
- delete
- get
- list
- patch
- update
- deletecollection
- apiGroups:
- ""
resources:
- events
verbs:
- create
- get
- list
- patch
- update
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- apiGroups:
- ""
resources:
- pods/log
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- get
- list
- create
- apiGroups:
- apps
resources:
- deployments
- statefulsets
- replicasets
- daemonsets
verbs:
- list
- get
- apiGroups:
- apps.openshift.io
resources:
- deploymentconfigs
verbs:
- list
- get
- apiGroups:
- ""
resources:
- replicationcontrollers
verbs:
- get
- list
- apiGroups:
- argoproj.io
resources:
- rollouts
verbs:
- list
- get
- apiGroups:
- batch
resources:
- jobs
verbs:
- create
- list
- get
- delete
- deletecollection
- apiGroups:
- litmuschaos.io
resources:
- chaosengines
- chaosexperiments
- chaosresults
verbs:
- create
- list
- get
- patch
- update
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
app.kubernetes.io/part-of: litmus
name: pod-cpu-hog-sa
name: pod-cpu-hog-sa
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: pod-cpu-hog-sa
subjects:
- kind: ServiceAccount
name: pod-cpu-hog-sa
namespace: default
Deploy below ChaosEngine:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: chaosengine-pod-cpu-hog
namespace: default
spec:
annotationCheck: "true"
appinfo:
appkind: deployment
applabel: app=nginx
appns: default
chaosServiceAccount: pod-cpu-hog-sa
components:
runner:
image: litmuschaos/chaos-runner:2.14.0
imagePullPolicy: IfNotPresent
engineState: active
experiments:
- name: pod-cpu-hog
spec:
components:
env:
- name: CONTAINER_RUNTIME
value: containerd
- name: SOCKET_PATH
value: /run/containerd/containerd.sock
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CPU_CORES
value: "1"
- name: TARGET_CONTAINER
value: nginx
jobCleanUpPolicy: retain
Anything else we need to know?:
Log of pod-cpu-hog-vczplk-d5fsw
pod created during experiemnt:
time="2023-08-14T09:45:53Z" level=info msg="Experiment Name: pod-cpu-hog"
time="2023-08-14T09:45:53Z" level=info msg="[PreReq]: Getting the ENV for the pod-cpu-hog experiment"
time="2023-08-14T09:45:55Z" level=info msg="[PreReq]: Updating the chaos result of pod-cpu-hog experiment (SOT)"
time="2023-08-14T09:45:57Z" level=info msg="The application information is as follows" Namespace=default Label="app=nginx" App Kind=deployment
time="2023-08-14T09:45:57Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The Container status are as follows" container=nginx Pod=nginx-deployment-54bcfc567b-pjddz Readiness=true
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The status of Pods are as follows" Pod=nginx-deployment-54bcfc567b-pjddz Status=Running
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The Container status are as follows" container=nginx Pod=nginx-deployment-54bcfc567b-sm4ql Readiness=true
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The status of Pods are as follows" Pod=nginx-deployment-54bcfc567b-sm4ql Status=Running
time="2023-08-14T09:45:59Z" level=info msg="[Info]: The chaos tunables are:" Sequence=parallel PodsAffectedPerc=0 CPU Core=1 CPU Load Percentage=100
time="2023-08-14T09:45:59Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2023-08-14T09:45:59Z" level=info msg="[Info]: Target pods list for chaos, [nginx-deployment-54bcfc567b-pjddz]"
time="2023-08-14T09:45:59Z" level=info msg="[Info]: Details of application under chaos injection" PodName=nginx-deployment-54bcfc567b-pjddz NodeName=amit-vm-2 ContainerName=nginx
time="2023-08-14T09:45:59Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2023-08-14T09:46:04Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2023-08-14T09:49:37Z" level=error msg="[Error]: CPU hog failed, err: helper pod failed, err: Unable to find the pods with matching labels"
Events from the Job that creates pod-cpu-hog-vczplk-d5fsw
pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 10s job-controller Created pod: pod-cpu-hog-vczplk-d5fsw
Normal SuccessfulDelete 2s job-controller Deleted pod: pod-cpu-hog-helper-xrpvbv
It seems like the helper pod is getting deleted immediately after it is created.
I am also facing the same issue on Amazon EKS cluster (1.27), it works correctly on v 1.24.
Same here ! With litmus 3.0.0-beta8 (and reproduced on 3.0.0-beta7 too)
on EKS 1.27 and was working fine on 1.26
Might this be related to ---container-runtime
deprecation since 1.24 and removed in 1.27 ?
here on kube release notes
Able to reproduce on Minikube + containerd + litmus 3.0.0-beta8
- case 1 : witness group on kubernetes 1.26.8
All chaos experiments requiring container runtime working fine
case 2 : error group running kubernetes 1.27
Helper instantly killed
We'll stick our clusters on kube 1.26.X (less than 1.27) on our side for now but please Harness/Litmus team have a look at https://kubernetes.io/blog/2023/03/17/upcoming-changes-in-kubernetes-v1-27/#removal-of-container-runtime-command-line-argument
This is fixed in 3.00beta10 via https://github.com/litmuschaos/litmus-go/pull/665
In 2.14.1 via https://github.com/litmuschaos/litmus-go/pull/669
This is fixed in 3.00beta10 via litmuschaos/litmus-go#665
In 2.14.1 via litmuschaos/litmus-go#669
Based on the PRs, how does deleting labels fix the issue? The release notes state a kubelet flag but I don't see how that would impact starting the helper pods via the k8s API?
EDIT: Or is it related to the standard labels that are added to pods since 1.27
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.27.md#api-change-4
Pods owned by a Job now uses the labels batch.kubernetes.io/job-name and batch.kubernetes.io/controller-uid. The legacy labels job-name and controller-uid are still added for compatibility. (#114930, @kannon92)
@ksatchit can the 2.14.1 be pushed to dockerhub? The only other solution is using 3.x which is big change (and I am yet to have it fully working..)