AgentBaker icon indicating copy to clipboard operation
AgentBaker copied to clipboard

[BUG]: aks-log-collector.sh creates a large ip_netns_commands.txt which lead to ephemeral-storage issues #4148

Open axelgMS opened this issue 1 year ago • 1 comments

What happened: cf https://github.com/Azure/AKS/issues/4148

Describe the bug We have observed that files ip_netns_commands.txt (example of location folder /tmp/tmp.4FKbTfOrn4/collect) in our AKS cluster nodes sometimes growing to many GBs and when the size comes to around 90GB nodes start having issues with ephemeral storage (The node was low on resource: ephemeral-storage.) then pods become evicted and multiple other issues appear.

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# ls -lh ip_netns_commands.txt -rw-r--r-- 1 root root 38G Mar 7 13:19 ip_netns_commands.txt root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# fuser -v ip_netns_commands.txt USER PID ACCESS COMMAND /tmp/tmp.4FKbTfOrn4/collect/ip_netns_commands.txt: root 83233 F.... ip root 109814 F.... ip root 560481 F.... ip root 1133085 F.... ip root 1133086 F.... ip root 1133087 F.... ip root 1134797 F.... ip root 1737066 F.... ip root 1737172 F.... ip root 1737210 F.... ip root 1737451 F.... ip

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# pstree -aps 83233 systemd,1 └─aks-log-collect,61814 /opt/azure/containers/aks-log-collector.sh └─ip,83233 -all netns exec /bin/bash -x -c... └─ip,109814 -all netns exec /bin/bash -x -c... └─ip,1737066 -all netns exec /bin/bash -x -c... └─ip,1737172 -all netns exec /bin/bash -x -c... └─ip,1737210 -all netns exec /bin/bash -x -c... └─ip,1737451 -all netns exec /bin/bash -x -c... └─ip,560481 -all netns exec /bin/bash -x -c... └─ip,1133085 -all netns exec /bin/bash -x -c... └─bash,1147264 -x -c... └─ss,1147282 -anoempiO --cgroup

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# head --lines 20 /opt/azure/containers/aks-log-collector.sh #! /bin/bash

AKS Log Collector

This script collects information and logs that are useful to AKS engineering

for support and uploads them to the Azure host via a private API. These log

bundles are available to engineering when customers open a support case and

are especially useful for troubleshooting failures of networking or

kubernetes daemons.

This script runs via a systemd unit and slice that limits it to low CPU

priority and 128MB RAM, to avoid impacting other system functions.

Log bundle upload max size is limited to 100MB

MAX_SIZE=104857600

Shell options - remove non-matching globs, don't care about case, and use

extended pattern matching

shopt -s nullglob nocaseglob extglob

AKS 1.28.5

axelgMS avatar Apr 22 '24 14:04 axelgMS

One way to stop the issue is disabling log collector on the nodes, it is controlled by a timer systemd unit.

systemctl stop aks-log-collector.timer
systemctl disable aks-log-collector.timer

This would have to be ran on every node. We are looking into a "long term" fix for this issue.

UtheMan avatar Apr 25 '24 23:04 UtheMan

fixed in https://github.com/Azure/AgentBaker/pull/4357

cameronmeissner avatar Jun 28 '24 19:06 cameronmeissner