Normalize metrics reported based on cgroup in all APM agents

Open gregkalapos opened this issue 1 year ago • 0 comments

Description

Problem

Currently all our APM Agents report memory and CPU usage on some way. When the application runs directly on a physical or virtual machine, metrics reported are typically correct.

However, our metric-story when the monitored application runs within a container and we report memory/CPU usage based on cgroups is not that clear.

Specific issues:

Currently most agents send memory metrics based on cgroups (system.process.cgroup.memory.mem.limit.bytes and system.process.cgroup.memory.mem.usage.bytes), but not all of them send CPU related metrics based on cgroups (e.g. Java does - it uses an API which takes container CPU usage into consideration). There are 2 problems with this: 1) our charts will be inconsistent: we may chart memory usage from the point of a container but chart CPU usage from the host's point of view and 2) APM Agents may report different CPU usage compared to other components - e.g. we know that metricbeat and some APM Agents report different CPU usage in Kubernetes. Here is a list of Kubernetes related CPU usage metrics.
The UI team currently does a fairly complex calculation to calculate the total memory usage - the complexity here is added again by memory related metrics based on cgroup. The key issue is in percentCgroupMemoryUsedScript - the system.process.cgroup.memory.mem.limit.bytes may not be set, which means a pod can take up all the memory of the host. But in this case agents typically send the "magic value" of 9223372036854771712L, which is then "fixed" by the UI. While this works for our UI, for a user it's almost impossible to recreate a correct memory usage graph when an APM Agent sends memory related metrics based on cgroups.

Solution

For the issue 1 above, each APM Agent must make sure that if the application runs within a container, the CPU usage reported is from the POD's point of view. So system.cpu.total.norm.pct and system.process.cpu.total.norm.pct should be aligned with the reported cgroup memory metrics. For some agents this may be already implemented and no action is needed on this.
For the issue 2 above, decrease the complexity of the UI codeI by sending "real" system.process.cgroup.memory.mem.limit.bytes. This means, if no memory limit is set for the POD, the agent won't send the "magic" 9223372036854771712L value, instead, it'll send system.memory.total in system.process.cgroup.memory.mem.limit.bytes. This means the max memory that a pod can take is the same as the overall system memory. We also need to update the agent spec to describe this behaviour.