usage-metrics-collector icon indicating copy to clipboard operation
usage-metrics-collector copied to clipboard

Add cgroupv2 support

Open ehashman opened this issue 1 year ago • 2 comments

Currently, UMC only supports collection of metrics with cgroupv1. cgroupv2 has been GA in Kubernetes since 1.25, so it would be nice to add support.

/kind feature

ehashman avatar Jul 31 '24 18:07 ehashman

/triage accepted

ehashman avatar Aug 07 '24 20:08 ehashman

Here is a design doc I've written for how these changes should work (implementation in #140):

In order to add support in usage-metrics-collector for cgroupv2 systems, we must address the following:

  • Determine equivalent metric counterparts of cgroupv1 metrics on cgroupv2
  • Update any types to ensure compatibility with cgroupv2
  • Add cgroupv2 support to any affected component in usage-metrics-collector (e.g. node sampler, ctrstats, etc.)

Note: usage-metrics-collector documents cgroupv1 metrics here: https://github.com/kubernetes-sigs/usage-metrics-collector/blob/d82be733bf050986d2285f2731a350a1aaa12cf0/pkg/api/samplerserverv1alpha1/doc.go#L38-L48

CPU usage samples

cgroupv1 example:

cat /sys/fs/cgroup/cpuacct/system.slice/cpuacct.usage
5771997191964 # usage

sys/fs/cgroup/cpuacct/system.slice/kubelet.service/cpuacct.usage
sys/fs/cgroup/cpuacct/kubepods/cpuacct.usage
sys/fs/cgroup/cpuacct/kubepods/guaranteed/cpuacct.usage
sys/fs/cgroup/cpuacct/kubepods/burstable/podpod12345/abcdef/cpuacct.usage

cat /sys/fs/cgroup/cpu/kubepods/burstable/podpod12345/abcdef/cpu.stat
nr_periods 123258
nr_throttled 698
throttled_time 18136472262

cgroupv2 example:

NOTE: cpuacct subsystem: Removed in cgroupv2. Information is now rolled up under cpu.stat. (see e.g.: https://groups.google.com/g/linux.kernel/c/p9sBmjWmgxk)

cat /sys/fs/cgroup/system.slice/cpu.stat
usage_usec 173966996704  # replaces cpuacct.usage
user_usec 86543134849
system_usec 87423861854
core_sched.force_idle_usec 0
nr_periods 0
nr_throttled 0
throttled_usec 0  # replaces throttled_time
nr_bursts 0
burst_usec 0

/sys/fs/cgroup/system.slice/kubelet.service/cpu.stat
/sys/fs/cgroup/kubepods/cpu.stat
/sys/fs/cgroup/kubepods/cpu.stat
/sys/fs/cgroup/kubepods/burstable/cpu.stat
/sys/fs/cgroup/kubepods/burstable/podpod12345/cpu.stat

Note that CPU times are now in microseconds, not in nanoseconds. In order to match units, cgroupv2 values need to be multiplied by 1000 (as 1 microsecond = 1000 nanoseconds).

Summary of required changes

  • Path change from subsystems to unified tree
  • cpuacct is gone and merged into cpu.stat (single file to read)
  • usage_usec and throttled_usec must be parsed out of new file to meet our metrics format
  • Accounting is updated to microseconds (usec) instead of nanoseconds
  • Other metrics can be used as is

Documentation

cgroupv1: https://www.kernel.org/doc/Documentation/cgroup-v1/cpuacct.txt cgroupv2: https://docs.kernel.org/admin-guide/cgroup-v2.html#cpu-interface-files

Memory usage samples

cgroupv1 example:

cat /sys/fs/cgroup/memory/system.slice/kubelet.service/memory.stat
cache 17301504
rss 64081920
rss_huge 0
shmem 0
mapped_file 0
dirty 135168
writeback 1216512
swap 0
pgpgin 162855
pgpgout 142863
pgfault 364089
pgmajfault 0
inactive_anon 0
active_anon 64069632
inactive_file 7839744
active_file 9191424
unevictable 0
hierarchical_memory_limit 9223372036854771712
hierarchical_memsw_limit 9223372036854771712
total_cache 17301504
total_rss 64081920
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 135168
total_writeback 1216512
total_swap 0
total_pgpgin 162855
total_pgpgout 142863
total_pgfault 364089
total_pgmajfault 0
total_inactive_anon 0
total_active_anon 64069632
total_inactive_file 7839744
total_active_file 9191424
total_unevictable 0

sys/fs/cgroup/memory/kubelet/kubepods/memory.stat
sys/fs/cgroup/memory/kubelet/kubepods/guaranteed/memory.stat
sys/fs/cgroup/memory/kubelet/kubepods/burstable/podpod12345/abcdef/memory.stat
cat /sys/fs/cgroup/memory/kubelet/kubepods/burstable/podpod12345/abcdef/memory.oom_control
oom_kill_disable 0
under_oom 0
oom_kill 0

cat /sys/fs/cgroup/memory/kubelet/kubepods/burstable/podpod12345/abcdef/memory.failcnt
0

cgroupv2 example:

There are many more changes in the cgroupv2 memory model and controllers vs. cpu which is mostly unchanged but reorganized.

cat /sys/fs/cgroup/system.slice/kubelet.service/memory.stat
anon 86716416  # equivalent to total_rss in v1
file 130658304
kernel 1736704
kernel_stack 245760
pagetables 589824
percpu 0
sock 0
vmalloc 0
shmem 114688
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 56623104
file_thp 0
shmem_thp 0
inactive_anon 98705408
active_anon 118784
inactive_file 43196416
active_file 87347200
unevictable 0
slab_reclaimable 551568
slab_unreclaimable 318888
slab 870456
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgscan 0
pgsteal 0
pgscan_kswapd 0
pgscan_direct 0
pgsteal_kswapd 0
pgsteal_direct 0
pgfault 1117892
pgmajfault 0
pgrefill 0
pgactivate 22366
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 32
thp_collapse_alloc 10

cat /sys/fs/cgroup/system.slice/kubelet.service/memory.events
low 0
high 0
max 0
oom 0  # Replaces memory.failcnt
oom_kill 0  # same as before
oom_group_kill 0

cat /sys/fs/cgroup/system.slice/kubelet.service/memory.current 
241487872

oom_kill is unchanged but moved, total_rss is renamed to “anon”, but total memory can no longer be calculated from total_rss + total_cache + swap. We will likely need to migrate to use memory.current in some way under cgroupv2.

Summary of required changes

  • Path change from subsystems to unified tree
  • Memory stats are split across three new files (memory.current, memory.stat, memory.events)
  • Two stats (anon, oom_kill) could be substituted directly
  • Decision required: We no longer split memory utilization into total_rss + total_cache = current. We could backfill this equivalently as total_cache = current - anon?
    • My plan: add new total memory field, and ignore cache for cgroupv2 (and ignore total on v1)

Sidebar on the new memory metrics/stats

How can we calculate total_rss and total_cache from the new cgroupv2 memory stats? (Why did we previously collect total_cache and not e.g. memory.usage_in_bytes?)

Kernel Documentation (cgroupv1):

5.2 stat file

memory.stat file includes following statistics

# per-memory cgroup local status
cache        - # of bytes of page cache memory.
rss          - # of bytes of anonymous and swap cache memory (includes
             transparent hugepages).

...

Brief summary of control files.
memory.failcnt             # show the number of memory usage hits limits

Kernel Documentation (cgroupv2): https://docs.kernel.org/admin-guide/cgroup-v2.html#memory

StackOverflow: Cgroup v2 memory.current is an equivalent of cgroup v1 memory.usage_in_bytes. ... Cgroup v2 has field memory.stat:anon which is an exact equivalent of v1 memory.stat:total_rss (i.e. includes all nested cgroups).

(That is to say, memory.current is equivalent but not exactly the same as usage_in_bytes, whereas anon is an exact equivalent of total_rss.)

Accounting weirdness

cat /sys/fs/cgroup/memory/memory.usage_in_bytes
8329428992
cat /sys/fs/cgroup/memory/memory.stat
...
total_cache 4346445824
total_rss 3862048768

total_cache + total_rss = 8208494592 < 8329428992 ? Why do these not match? e.g. https://github.com/google/cadvisor/issues/638#issuecomment-160123132 does not appear to hold here (as swap = 0)

From cgroupv1 kernel documentation,

5.5 usage_in_bytes

For efficiency, as other kernel components, memory cgroup uses some optimization
to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
value for efficient access. (Of course, when necessary, it's synchronized.)
If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
value in memory.stat(see 5.2).

🤷 So usage_in_bytes is a little off and the sum total_rss + total_cache (+ swap) is the correct value.

For cgroupv2,

memory.current A read-only single value file which exists on non-root cgroups. The total amount of memory currently being used by the cgroup and its descendants.

The kernel documentation doesn’t have any caveats on the calculation, so this seems like the right way forward.

Note: containerd uses “Memory.Usage” which is actually memory.current for both v2 and v1 https://github.com/containerd/containerd/blob/7a804489fdd528cc052071ce47d0217f3c6bcea9/core/metrics/cgroups/v2/memory.go#L39 https://github.com/containerd/cgroups/blob/0c03de4a3d82a5f02f455ccc8174cb0dc9c2a532/cgroup2/manager.go#L629 (usage_in_bytes for v1 https://github.com/containerd/cgroups/blob/0c03de4a3d82a5f02f455ccc8174cb0dc9c2a532/cgroup1/memory.go#L288)

ehashman avatar Aug 13 '24 21:08 ehashman