logstash PoC to add cgroup v2 support

Release notes

[rn:skip]

What does this PR do?

Very rudimentary implementation to parse cgroup v2 information. This PR code should just be a PoC. It shows, that - at least - the CPU information of a process running in cgroup v2 can be successfully retrieved and returned by logstash (and fix issue https://github.com/elastic/logstash/issues/14534)

Why is it important/What is the impact to the user?

When running logstash on a VM or container, that is already using the cgroup v2, no metrics will be exposed by logstash and are therefore missing within Stack Monitoring

Checklist

Nothing got checked on my side

[ ] My code follows the style guidelines of this project
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] I have made corresponding change to the default configuration files (and/or docker env variables)
[ ] I have added tests that prove my fix is effective or that my feature works

Author's Checklist

How to test this PR locally

create a VM that uses cgroup2fs. You can verify this by running stat -fc %T /sys/fs/cgroup. When running cgroup v2, the result should be cgroup2fs. See example

logstash@in-beats-1:/opt$ stat -fc %T /sys/fs/cgroup/
cgroup2fs
logstash@in-beats-1:/opt$

run logstash and check the API via curl localhost:9600/_node/stats?pretty . Results should contain an os object looking somehow like this

{
  "host" : "in-beats-1",
  "version" : "8.5.0",
  "http_address" : "0.0.0.0:9600",
  "id" : "02e938c4-9a6a-4e38-8023-ca54b86d685c",
  "name" : "in-beats-1",
  "ephemeral_id" : "5343f7f9-093c-4031-aaa3-e24a6312588a",
  "status" : "green",
  "snapshot" : true,
  "pipeline" : {
    "workers" : 4,
    "batch_size" : 125,
    "batch_delay" : 50
  },
  "monitoring" : {
    "cluster_uuid" : "some_cluster_id"
  },
  "jvm" : {
    "threads" : {
      "count" : 206,
      "peak_count" : 208
    },
...
...
  "os" : {
    "cgroup" : {
      "cpuacct" : {
        "control_group" : "/",
        "usage_nanos" : 895076842
      },
      "cpu" : {
        "control_group" : "/",
        "cfs_period_micros" : 100000,
        "cfs_quota_micros" : 400000,
        "stat" : {
          "number_of_times_throttled" : 142,
          "number_of_elapsed_periods" : 58122,
          "time_throttled_nanos" : 65119181
        }
      }
    }
  },
  "queue" : {
    "events_count" : 0
  }
}

Related issues

Relates #14534

Use cases

Screenshots

And here from the running pod

"os" : {
    "cgroup" : {
      "cpu" : {
        "stat" : {
          "time_throttled_nanos" : 70506861,
          "number_of_elapsed_periods" : 63733,
          "number_of_times_throttled" : 145
        },
        "cfs_quota_micros" : 400000,
        "control_group" : "/",
        "cfs_period_micros" : 100000
      },
      "cpuacct" : {
        "control_group" : "/",
        "usage_nanos" : 908588206
      }
    }
  }

Note: Still something is missing

Still, the Stack Monitoring does not fully work. It looks like the cfs values are exported by logstash, but they are not picked up by metricbeat (the node and node_stats code of logstash module do not contain a JSON structure to use - at least - the cfs_quota_micros values returned (see also https://github.com/elastic/beats/blob/main/metricbeat/module/logstash/node_stats/data.go#L79)

Logs

Sep 19 '22 10:09 rpasche

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

Sep 19 '22 10:09 logstashmachine

Sorry....some missing infos:

Stack Monitoring (elasticsearch, kibana, metricbeat): 8.4.1 version
logstash: latest code of main branch + PoC code

Sep 19 '22 10:09 rpasche

This screenshot was taken from the .monitoring-logstash-8-mb data-stream

which has dynmatic set to false and no mapping for cfs fields exist

Sep 19 '22 14:09 rpasche

logstash logstash copied to clipboard

PoC to add cgroup v2 support

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Note: Still something is missing

Logs

logstash
logstash copied to clipboard