autoscaling Debugging tools more easily available in compute pods and VMs

Problem description / Motivation

Debugging tools missing in compute VMs and not easily installable when VM is at memory limit. And because pod uses Alpine and the VM Debian we can't simply copy tools and libs from the pod into the VM.

Feature idea(s) / DoD

During INC-415 it would have helped a lot to have network debugging tools readily available in at least the pod and possibly in the VM.

Implementation ideas

I understand we don't want to have too many things exposed in the VM by default, so I would be good if one can easily install debug tooling ad-hoc. This could be a tarball in the pod that unpacked inside the VM when a script in the pod is run.

Ideally, we would have a pre-built image used with kubectl debug containing everything we need.

Mar 06 '25 13:03 cloneable

From @kelvich: fixing #1304 might provide another route for this issue

Mar 10 '25 16:03 sharnoff

that was helpful:

apt update
apt install -y tcpdump screen iproute2 dnsutils iputils-ping lsof strace

Mar 10 '25 16:03 kelvich

Right, I should have mentioned that sometimes it's not possible to install tools when VM is too loaded:

$ apt install iproute2 tcpdump
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  dbus libapparmor1 libatm1 libbpf0 libbsd0 libcap2 libcap2-bin libdbus-1-3 libmd0 libmnl0 libpam-cap libpcap0.8
  libxtables12
Suggested packages:
  default-dbus-session-bus | dbus-session-bus iproute2-doc apparmor
The following NEW packages will be installed:
  dbus iproute2 libapparmor1 libatm1 libbpf0 libbsd0 libcap2 libcap2-bin libdbus-1-3 libmd0 libmnl0 libpam-cap libpcap0.8
  libxtables12 tcpdump
0 upgraded, 15 newly installed, 0 to remove and 10 not upgraded.
Need to get 0 B/2556 kB of archives.
After this operation, 7033 kB of additional disk space will be used.
Do you want to continue? [Y/n]
FATAL -> Failed to fork.

Mar 10 '25 17:03 cloneable

By the way, do we know what exactly causes Failed to fork? I am looking at the VM and it doesn't look "too loaded":

root@compute-billowing-waterfall-w2v36sgm-nhzfg:~# uptime
 10:41:58 up  2:44,  0 users,  load average: 0.48, 0.57, 0.54
root@compute-billowing-waterfall-w2v36sgm-nhzfg:~# free -h
               total        used        free      shared  buff/cache   available
Mem:           914Mi       467Mi       154Mi        32Mi       292Mi       400Mi
Swap:          1.0Gi          0B       1.0Gi
root@compute-billowing-waterfall-w2v36sgm-nhzfg:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        20G  1.4G   18G   8% /
devtmpfs        456M     0  456M   0% /dev
shm-tmpfs        40G  1.1M   40G   1% /dev/shm
/dev/vdb         50K   50K     0 100% /neonvm/runtime
/dev/vdc         40K   40K     0 100% /mnt/ssh
/dev/vde         35G  6.7M   33G   1% /neonvm/cache
/dev/vdf        196G   19M  186G   1% /var/db/postgres/compute
root@compute-billowing-waterfall-w2v36sgm-nhzfg:~# apt install dnsutils
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  bind9-dnsutils bind9-host bind9-libs libbsd0 libedit2 libfstrm0 liblmdb0 libmaxminddb0 libmd0 libuv1
Suggested packages:
  mmdb-bin
FATAL -> Failed to fork.

Mar 19 '25 10:03 olegbbtr

Sorry, by "too loaded" I meant too high memory use. I didn't check exactly how high it was at the time. I don't have prod access right now, maybe check cgroup limits too.

[65164.881136] __vm_enough_memory: pid: 32509, comm: apt-get, not enough memory for the allocation

Mar 19 '25 11:03 cloneable

This issue was moved to Jira: LKB-1095

Jul 21 '25 09:07 zenithdb-bot-dev[bot]

autoscaling autoscaling copied to clipboard

Debugging tools more easily available in compute pods and VMs

Problem description / Motivation

Feature idea(s) / DoD

Implementation ideas

autoscaling
autoscaling copied to clipboard