usage-metrics-collector
usage-metrics-collector copied to clipboard
[WIP] cgroupv2 support
Closes #138.
Goals:
- don't get rid of cgroupv1 support
- add support for cgroupv2 for all existing metrics pulled out of cgroups
- update both the direct cgroup walking logic and the ctrstats code paths
Design:
Details can be found at https://github.com/kubernetes-sigs/usage-metrics-collector/issues/138#issuecomment-2287165401
Summary of changes:
- cgroup tree is now unified, no need to specify separate paths for cpu, cpuacct, memory, etc.
- CPU:
- cpuacct is gone
- All cpu metrics can be gathered from
cpu.statfile - cgroupv2 cpu metrics are now in microseconds, not nanoseconds
- Memory:
- An accurate measure of current memory usage is available at
memory.current(unlikememory.usage_in_byteson v1 which was an estimate) - We can now get all OOM information out of
memory.events - If we need RSS, we can use
anonfield out ofmemory.statfile
- An accurate measure of current memory usage is available at
- For UMC, we need to update both the cgroup-walking code (this PR) and the ctrstats code that uses the CRI library (possibly a follow-up PR)
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ehashman
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [ehashman]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Taking this out of draft so CI runs.
I was able to get the basic cgroupv2 integration test (copied from the "basic" test) to mostly pass. There was a little bit of weirdness in the output values that I am not sure if I should just update the test case for or if there's an actual problem.
I noticed min.slice on cgroupv1 had a -1 value in the input samples. This did not work with cgroupv2 so I changed those results to 0s, but the test data isn't quite as expected. I'm not sure if there is a special meaning to the -1 CPU usage on cgroupv1.
Otherwise, this should not break any cgroupv1 functionality. I will work on ctrstats next but that should be more of a trivial change, so if folks want to take a look at this now, it should be ready for initial review.
I will also edit my first comment with the dictionary for updating v1 to v2.
/retest-required
flake #126
/triage accepted
/retest-required
flake https://github.com/kubernetes-sigs/usage-metrics-collector/issues/126
/retest-required