litmus
litmus copied to clipboard
Disk fill for ephemeral storage doesn't work properly
What happened:
This time I'm trying to execute disk fill. No matter what percentage of disk fill I will choose (20 % or 80 %) I got the same error message: time="2023-11-30T14:56:10Z" level=info msg="[Fill]: Filling ephemeral storage, size: 214748KB" time="2023-11-30T14:56:10Z" level=info msg="dd: {sudo dd if=/dev/urandom of=/proc/1249333/root/home/diskfill bs=256K count=838}" time="2023-11-30T14:56:13Z" level=fatal msg="helper pod failed, err: could not fill ephemeral storage\n --- at /litmus-go/chaoslib/litmus/disk-fill/helper/disk-fill.go:137 (diskFill) ---\nCaused by: {"source":"disk-fill-helper-nsnw2","errorCode":"CHAOS_INJECT_ERROR","reason":"838+0 records in\n838+0 records out\n","target":"{podName: testing-pod-86b47d547d-vnzfb, namespace: test, container: test"}"
What you expected to happen:
Disk fill should end with success.
Where can this issue be corrected? (optional)
This part of the code should be fixed:
https://github.com/litmuschaos/litmus-go/blob/v3.1.x/chaoslib/litmus/disk-fill/helper/disk-fill.go#L342
https://github.com/litmuschaos/litmus-go/blob/v3.1.x/chaoslib/litmus/disk-fill/helper/disk-fill.go#L178
How to reproduce it (as minimally and precisely as possible):
It can be easily reproducible by executing it from Litmus Portal however I did it also manually trying to find where can be the problem:
-
I was able to create manually helper pod and run it on my GKE cluster to experiment with disk-fill
-
I was able to find the containerID and container PID on a POD I'm going to fill the ephemeral storage
-
First thing: the size of ephemeral storage USED is wrongly calculated (at least in Litmus 3.1) because it uses following function: du := fmt.Sprintf("sudo du /proc/%v/root", t.TargetPID) but if this (/proc/%v/root) is symlink and it's it will return 0 value all the time when you do this by providing /proc/%v/root/ (slash at the end) it will return proper value.
-
I did "dd" command manually from helper pod bash-5.1# crictl inspect --output yaml ac136572dd3cf| egrep pid pid: 1 pid: 1825267
- type: pid bash-5.1# dd if=/dev/urandom of=/proc/1825267/root/home/diskfill bs=256K count=10485 10485+0 records in 10485+0 records out bash-5.1# echo $? 0
-
File exists: ls -latrh /proc/1825267/root/home/diskfill -rw-r--r-- 1 root root 2.6G Dec 8 12:35 /proc/1825267/root/home/diskfill
-
When I crate bigger file (bigger than ephemeral storage limit) pod is evicted - which works perfectly
-
But, when we run it from Litmus toolkit it fails ... no more messages. I've checked it in the code and it seems it comes from this code: if t.SizeToFill > 0 { if err := fillDisk(t, experimentsDetails.DataBlockSize); err != nil { return stacktrace.Propagate(err, "could not fill ephemeral storage")
-
I think helper-pod catches output from dd command like: 10485+0 records in 10485+0 records out as an error and marks it the same so entire injection is marked as failed.
Anything else we need to know?:
Any updates on it ?
Can someone from DEV team comment on it ?
Hi @ash-man thanks for raising this issue. We're looking into it.