usage-metrics-collector icon indicating copy to clipboard operation
usage-metrics-collector copied to clipboard

[WIP] cgroupv2 support

Open ehashman opened this issue 1 year ago • 5 comments

Closes #138.

Goals:

  • don't get rid of cgroupv1 support
  • add support for cgroupv2 for all existing metrics pulled out of cgroups
  • update both the direct cgroup walking logic and the ctrstats code paths

Design:

Details can be found at https://github.com/kubernetes-sigs/usage-metrics-collector/issues/138#issuecomment-2287165401

Summary of changes:

  • cgroup tree is now unified, no need to specify separate paths for cpu, cpuacct, memory, etc.
  • CPU:
    • cpuacct is gone
    • All cpu metrics can be gathered from cpu.stat file
    • cgroupv2 cpu metrics are now in microseconds, not nanoseconds
  • Memory:
    • An accurate measure of current memory usage is available at memory.current (unlike memory.usage_in_bytes on v1 which was an estimate)
    • We can now get all OOM information out of memory.events
    • If we need RSS, we can use anon field out of memory.stat file
  • For UMC, we need to update both the cgroup-walking code (this PR) and the ctrstats code that uses the CRI library (possibly a follow-up PR)

ehashman avatar Aug 06 '24 20:08 ehashman

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

k8s-ci-robot avatar Aug 06 '24 20:08 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ehashman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Aug 06 '24 20:08 k8s-ci-robot

Taking this out of draft so CI runs.

I was able to get the basic cgroupv2 integration test (copied from the "basic" test) to mostly pass. There was a little bit of weirdness in the output values that I am not sure if I should just update the test case for or if there's an actual problem.

I noticed min.slice on cgroupv1 had a -1 value in the input samples. This did not work with cgroupv2 so I changed those results to 0s, but the test data isn't quite as expected. I'm not sure if there is a special meaning to the -1 CPU usage on cgroupv1.

Otherwise, this should not break any cgroupv1 functionality. I will work on ctrstats next but that should be more of a trivial change, so if folks want to take a look at this now, it should be ready for initial review.

I will also edit my first comment with the dictionary for updating v1 to v2.

ehashman avatar Aug 13 '24 21:08 ehashman

/retest-required

flake #126

ehashman avatar Aug 26 '24 19:08 ehashman

/triage accepted

ehashman avatar Aug 26 '24 20:08 ehashman

/retest-required

flake https://github.com/kubernetes-sigs/usage-metrics-collector/issues/126

ehashman avatar Sep 23 '24 20:09 ehashman

/retest-required

ehashman avatar Sep 24 '24 16:09 ehashman