litmus icon indicating copy to clipboard operation
litmus copied to clipboard

the pod-cpu-hog function does not work.(helper pod timeout err)

Open hmi2622 opened this issue 1 year ago • 1 comments

What happened:

I want to execute the pod-cpu-hog function. when the execution is applied, it is deployed in the order of runner -> experience pod -> helper pod, and it ends sequentially.

However, if you look at the log of the helper pod, you can see that timeout has occurred, and the details are as follows.

time="2024-01-23T06:48:12Z" level=info msg="Helper Name: stress-chaos" time="2024-01-23T06:48:12Z" level=info msg="[PreReq]: Getting the ENV variables" time="2024-01-23T06:48:12Z" level=info msg="container ID of amcs container, containerID: d199bd63e5d39b1e02710250cc8abb04ee59c35a17aa1ee7706ddcaadf2a98e9" time="2024-01-23T06:48:12Z" level=info msg="[Info]: Container ID=d199bd63e5d39b1e02710250cc8abb04ee59c35a17aa1ee7706ddcaadf2a98e9 has process PID=1572728" time="2024-01-23T06:48:12Z" level=info msg="[Info]: Details of Stressor:" CPU Core=1 Timeout=30 time="2024-01-23T06:48:12Z" level=info msg="[Info]: starting process: pause nsutil -t 1572728 -p -- stress-ng --timeout 30s --cpu 1" time="2024-01-23T06:48:12Z" level=info msg="[Info]: Sending signal to resume the stress process" time="2024-01-23T06:48:15Z" level=info msg="[Wait]: Waiting for chaos completion" time="2024-01-23T06:49:15Z" level=info msg="[Timeout] Stress output: " time="2024-01-23T06:49:15Z" level=info msg="[Cleanup]: Killing the stress process" time="2024-01-23T06:49:15Z" level=info msg="[Info]: Stress process removed sucessfully" time="2024-01-23T06:49:15Z" level=fatal msg="helper pod failed, err: the stress process is timeout after 60s"

Please advise me how to analyze and resolve the timeout.

Anything else we need to know?:

And the description of chaosresult is counted as pass as follows. However, the results did not cause any disability.

Spec: Engine: litmus-chaos Experiment: pod-cpu-hog Status: Experiment Status: Fail Step: N/A Phase: Completed Probe Success Percentage: 100 Verdict: Pass History: Failed Runs: 0 Passed Runs: 3 Stopped Runs: 0 Targets: Chaos Status: reverted Kind: pod Name: test-9wqgp Events: Type Reason Age From Message


Normal Awaited 8m15s pod-cpu-hog-mzavhp-zlcjr experiment: pod-cpu-hog, Result: Awaited Normal Pass 6m58s pod-cpu-hog-mzavhp-zlcjr experiment: pod-cpu-hog, Result: Pass Normal Awaited 5m38s pod-cpu-hog-vc7c3q-bhzpw experiment: pod-cpu-hog, Result: Awaited Normal Pass 4m19s pod-cpu-hog-vc7c3q-bhzpw experiment: pod-cpu-hog, Result: Pass

On the other hand, pod-cpu-hog-exec normally causes chaos. I want to get the same result in two experiments.

time="2024-01-23T06:59:50Z" level=info msg="Experiment Name: pod-cpu-hog-exec" time="2024-01-23T06:59:50Z" level=info msg="[PreReq]: Getting the ENV for the experiment" time="2024-01-23T06:59:50Z" level=info msg="[PreReq]: Updating the chaos result of pod-cpu-hog-exec experiment (SOT)" time="2024-01-23T06:59:51Z" level=info msg="The application information is as follows" Namespace=test Label="app=test" Chaos Duration=120 Ramp Time=0 time="2024-01-23T06:59:51Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)" time="2024-01-23T06:59:51Z" level=info msg="[Status]: Checking whether application containers are in ready state" time="2024-01-23T06:59:51Z" level=info msg="[Status]: The Container status are as follows" Readiness=true container=test Pod=test-f85d79bc5-xldzl time="2024-01-23T06:59:53Z" level=info msg="[Status]: Checking whether application pods are in running state" time="2024-01-23T06:59:53Z" level=info msg="[Status]: The status of Pods are as follows" Pod=test-f85d79bc5-xldzl Status=Running time="2024-01-23T06:59:55Z" level=info msg="[Info]: chaos candidate of kind: deployment, name: test, namespace: test" time="2024-01-23T06:59:55Z" level=info msg="[Chaos]:Number of pods targeted: 1" time="2024-01-23T06:59:55Z" level=info msg="Target pods list for chaos, [test-f85d79bc5-xldzl]" time="2024-01-23T06:59:55Z" level=info msg="[Chaos]: The Target application details" Target Container=test Target Pod=test-f85d79bc5-xldzl CPU CORE=10 time="2024-01-23T06:59:55Z" level=info msg="[Chaos]:Waiting for: 120s" time="2024-01-23T07:01:55Z" level=info msg="[Chaos]: Time is up for experiment: pod-cpu-hog-exec" time="2024-01-23T07:02:13Z" level=info msg="[Confirmation]: pod-cpu-hog-exec chaos has been injected successfully" time="2024-01-23T07:02:13Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (post-chaos)" time="2024-01-23T07:02:13Z" level=info msg="[Status]: Checking whether application containers are in ready state"

hmi2622 avatar Jan 23 '24 07:01 hmi2622

my litmus version is 1.13.8

hmi2622 avatar Jan 23 '24 12:01 hmi2622

Hi @hmi, Can you try using the latest version of this experiment we have handled such issues in one of the later version. Even with 2.x or 3.x you should be able to run it.

Closing the issue feel free to reopen if you still face any issues in the latest one.

uditgaurav avatar Mar 07 '24 07:03 uditgaurav