Bottlerocket under-reports Ephemeral Storage Capacity
Image I'm using:
AMI Name: bottlerocket-aws-k8s-1.22-aarch64-v1.11.1-104f8e0f
What I expected to happen:
I expected that the capacity on my worker node would be approximately close to what the actual EBS volume size is to the xvdb mount for my node filesystem.
bash-5.1# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 904M 557M 285M 67% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 6.1G 1.2M 6.1G 1% /run
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
tmpfs 16G 476K 16G 1% /etc
tmpfs 16G 4.0K 16G 1% /etc/cni
tmpfs 16G 0 16G 0% /tmp
tmpfs 16G 4.0K 16G 1% /etc/containerd
tmpfs 16G 12K 16G 1% /etc/host-containers
tmpfs 16G 4.0K 16G 1% /etc/kubernetes/pki
tmpfs 16G 0 16G 0% /root/.aws
/dev/nvme1n1p1 4.3T 1.6G 4.1T 1% /local
/dev/nvme0n1p12 36M 944K 32M 3% /var/lib/bottlerocket
overlay 4.3T 1.6G 4.1T 1% /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/modules
overlay 4.3T 1.6G 4.1T 1% /opt/cni/bin
/dev/loop1 384K 384K 0 100% /aarch64-bottlerocket-linux-gnu/sys-root/usr/share/licenses
/dev/loop0 12M 12M 0 100% /var/lib/kernel-devel/.overlay/lower
overlay 4.3T 1.6G 4.1T 1% /aarch64-bottlerocket-linux-gnu/sys-root/usr/src/kernels
bash-5.1# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 11.6M 1 loop /var/lib/kernel-devel/.overlay/lower
loop1 7:1 0 292K 1 loop /aarch64-bottlerocket-linux-gnu/sys-root/usr/share/licenses
nvme0n1 259:0 0 2G 0 disk
|-nvme0n1p1 259:2 0 4M 0 part
|-nvme0n1p2 259:3 0 5M 0 part
|-nvme0n1p3 259:4 0 40M 0 part /boot
|-nvme0n1p4 259:5 0 920M 0 part
|-nvme0n1p5 259:6 0 10M 0 part
|-nvme0n1p6 259:7 0 25M 0 part
|-nvme0n1p7 259:8 0 5M 0 part
|-nvme0n1p8 259:9 0 40M 0 part
|-nvme0n1p9 259:10 0 920M 0 part
|-nvme0n1p10 259:11 0 10M 0 part
|-nvme0n1p11 259:12 0 25M 0 part
`-nvme0n1p12 259:13 0 42M 0 part /var/lib/bottlerocket
nvme1n1 259:1 0 4.3T 0 disk
`-nvme1n1p1 259:14 0 4.3T 0 part /var
/opt
/mnt
/local

What actually happened:
status:
addresses:
- address: 192.168.124.121
type: InternalIP
- address: ip-192-168-124-121.us-west-2.compute.internal
type: Hostname
- address: ip-192-168-124-121.us-west-2.compute.internal
type: InternalDNS
allocatable:
attachable-volumes-aws-ebs: "39"
cpu: 15890m
ephemeral-storage: "1342050565150"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
hugepages-32Mi: "0"
hugepages-64Ki: "0"
memory: 28738288Ki
pods: "234"
vpc.amazonaws.com/pod-eni: "54"
capacity:
attachable-volumes-aws-ebs: "39"
cpu: "16"
ephemeral-storage: 1457383148Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
hugepages-32Mi: "0"
hugepages-64Ki: "0"
memory: 31737584Ki
pods: "234"
vpc.amazonaws.com/pod-eni: "54"
CAdvisor or something in the BR image appears to be under-reporting the amount of capacity that I have on this worker node.
The ephemeral-storage capacity here is approximately 1457383148Ki ~= 1.35 Ti which is not close to the ~4.3T that the lsblk is reporting.
How to reproduce the problem:
- Launch a image with the BR AMI that has an EBS volume above > 4TB attached to the
/dev/xvdbmount - View the Node's capacity when it connects and joins to the cluster using
kubectl get nodeorkubectl describe node
@jonathan-innis, thanks for reaching out! We're taking a deeper look into this.
@jpculp Is there any progress or updates on this issue?
Unfortunately not yet. We have to take a deeper look at the interaction between the host, containerd, and cAdvisor. Out of curiosity, do you see the same behavior with bottlerocket-aws-k8s-1.24?
I haven't taken a look at the newer version on K8s 1.24. Let me take a look on a newer version of K8s and get back to you on that.
Hi @jonathan-innis, although I haven't fully root-caused the issue. I wanted to provide an update to offer some information.
I took a deeper look into this and it seems like the issue is stemming from kubelet not refreshing the node status or publishing the wrong filesystem stats to the K8s API.
When kubelet first starts up, cAdvisor hasn't fully initialized before kubelet makes the call to query for filesystem stats so kubelet reports invalid capacity 0 on image filesystem. Apparently this is expected to happen sometimes and eventually cadvisor will settle and start reporting stats. Whenever this happens, kubelet uses whatever stats it was able to scrounge up through this fallback. The linked issue in that code block mentions kubelet goes to the CRI for filesystem information. The issue is that those initial partial filesystem stats under-report available capacity as you've noticed. kubelet then doesn't attempt to update the K8s API with the correct filesystem stats even after cAdvisor is up and running.
After the node becomes ready, if I query the metrics endpoint, both cadvisor stats and node summary stats are reporting correctly:
cAdvisor:
...
container_fs_limit_bytes{container="",device="/dev/nvme1n1p1",id="/",image="",name="",namespace="",pod=""} 4.756159012864e+12 1675367414546
Node summary:
...
"fs": {
"time": "2023-02-02T19:50:04Z",
"availableBytes": 4561598828544,
"capacityBytes": 4756159012864,
"usedBytes": 1264123904,
"inodesFree": 294302950,
"inodes": 294336000,
"inodesUsed": 33050
},
But for some reason, the node object in the cluster does not reflect that in the K8s API:
kubectl describe node
Hostname: ip-192-168-92-51.us-west-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 4
ephemeral-storage: 1073420188Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16073648Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3920m
ephemeral-storage: 988190301799
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15056816Ki
pods: 58
Only reports ~988 GB
What's interesting is that once you either reboot the worker node or restart the kubelet service, the stats sync up correctly:
After rebooting, kubectl describe node:
Hostname: ip-192-168-92-51.us-west-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25 cpu: 4
ephemeral-storage: 4644686536Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16073648Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3920m
ephemeral-storage: 4279469362667
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15056816Ki
pods: 58
ephemeral-storage correctly reports the 4.2 TB available.
So it seems like kubelet is not updating node filesystem stats to the K8s API as frequently as it should. I currently can't explain why this is happening on Bottlerocket and not replicating on AL2. I suspect the kubelet fallback to querying the CRI has something to do with it, i.e. cri/containerd vs dockershim (https://github.com/kubernetes/kubernetes/pull/51152).
If you want to work around this issue, you can reboot the nodes or restart kubelet to get kubelet to start reporting the correct ephemeral storage capacity. In the meantime, I'll spend more time digging into kubelet/containerd.
Wondering if you are still seeing this behavior. If so, do the stats eventually correct themselves, or once it's in this state does it keep reporting the wrong size indefinitely.
There's a 10 second cache timeout for stats, so I wonder if we are hitting a case where the data in the cache needs to be invalidated before it actually checks again and gets the full storage space.
~~We ran into a similar issue as well on 1.28 resulting in pods unschedulable due to insufficient storage. I tried rebooting but that didn't seem to work~~
~~In our case we have a second EBS volume (1TB) we are using and it seems like its not picking it up at all.~~
I didn't realized I needed to specify the device as /dev/xvdb (/dev/xvda works on the aws linux ami), works fine once updated to that.
Still seeing this behavior. EKS 1.25. Entering admin container > sudo sheltie > systemctl restart kubelet.service causes it to start to report the correct value for ephemeral storage
FWIW, recently upgraded to 1.26, and the behavior is there as well
Hi @James-Quigley @jonathan-innis, I suspect this issue might be addressed by changes to include monitoring of the container runtime cgroup by kubelet https://github.com/bottlerocket-os/bottlerocket/pull/3804. Are you still seeing this issue on versions of Bottlerocket >= 1.19.5?
Resolving under the theory that this may have been fixed by #3804. If not, please re-open.