datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

Container check: ignore aberrant values for `container.memory.rss`

Open L3n41c opened this issue 2 years ago • 0 comments

What does this PR do?

Ignore aberrant values (close to 18 EiB) for container.memory.rss.

Motivation

Under some circumstances, the memory.stat cgroup file shows super high value for total_rss like for ex.:

/host/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1cbe9fde_6ee9_445f_8f59_7298e66618f5.slice/cri-containerd-38fe097c37a933d8d2f5d3fe50e751bb64d3001e07217529349a9446a2944946.scope/memory.stat
cache 1622016
rss 1409024
rss_huge 0
shmem 0
mapped_file 159744
dirty 0
writeback 0
swap 0
pgpgin 231327950
pgpgout 231327210
pgfault 347344261
pgmajfault 75
inactive_anon 1404928
active_anon 4096
inactive_file 204800
active_file 1417216
unevictable 0
hierarchical_memory_limit 268435456
hierarchical_memsw_limit 9223372036854771712
total_cache 1622016
total_rss 18446744073706844160
total_rss_huge 0
total_shmem 0
total_mapped_file 159744
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 231326246
total_pgpgout 231326507
total_pgfault 347341748
total_pgmajfault 75
total_inactive_anon 18446744073706860544
total_active_anon 0
total_inactive_file 204800
total_active_file 1417216
total_unevictable 0

In this case, the RSS value must be ignored to avoid triggering some “high memory” monitors.

Additional Notes

  • DataDog/integrations-core#13076

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Load the agent on a big cluster and look for spikes in max:container.memory.rss{$datacenter,$kube_cluster_name,$host} by {datacenter,kube_cluster_name,host}.fill(null) While the metric should still be there, the spikes should disappear with an agent having this change.

Reviewer's Checklist

  • [ ] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [ ] A release note has been added or the changelog/no-changelog label has been applied.
  • [ ] Changed code has automated tests for its functionality.
  • [ ] Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • [ ] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • [ ] If applicable, the config template has been updated.

L3n41c avatar Oct 11 '22 09:10 L3n41c