[BUG]: aks-log-collector.sh creates a large ip_netns_commands.txt which lead to ephemeral-storage issues #4148
What happened: cf https://github.com/Azure/AKS/issues/4148
Describe the bug We have observed that files ip_netns_commands.txt (example of location folder /tmp/tmp.4FKbTfOrn4/collect) in our AKS cluster nodes sometimes growing to many GBs and when the size comes to around 90GB nodes start having issues with ephemeral storage (The node was low on resource: ephemeral-storage.) then pods become evicted and multiple other issues appear.
root@aks-apps5-
root@aks-apps5-
root@aks-apps5-
AKS Log Collector
This script collects information and logs that are useful to AKS engineering
for support and uploads them to the Azure host via a private API. These log
bundles are available to engineering when customers open a support case and
are especially useful for troubleshooting failures of networking or
kubernetes daemons.
This script runs via a systemd unit and slice that limits it to low CPU
priority and 128MB RAM, to avoid impacting other system functions.
Log bundle upload max size is limited to 100MB
MAX_SIZE=104857600
Shell options - remove non-matching globs, don't care about case, and use
extended pattern matching
shopt -s nullglob nocaseglob extglob
AKS 1.28.5
One way to stop the issue is disabling log collector on the nodes, it is controlled by a timer systemd unit.
systemctl stop aks-log-collector.timer
systemctl disable aks-log-collector.timer
This would have to be ran on every node. We are looking into a "long term" fix for this issue.
fixed in https://github.com/Azure/AgentBaker/pull/4357