datadog-agent
datadog-agent copied to clipboard
Container check: ignore aberrant values for `container.memory.rss`
What does this PR do?
Ignore aberrant values (close to 18 EiB) for container.memory.rss
.
Motivation
Under some circumstances, the memory.stat
cgroup file shows super high value for total_rss
like for ex.:
/host/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1cbe9fde_6ee9_445f_8f59_7298e66618f5.slice/cri-containerd-38fe097c37a933d8d2f5d3fe50e751bb64d3001e07217529349a9446a2944946.scope/memory.stat
cache 1622016
rss 1409024
rss_huge 0
shmem 0
mapped_file 159744
dirty 0
writeback 0
swap 0
pgpgin 231327950
pgpgout 231327210
pgfault 347344261
pgmajfault 75
inactive_anon 1404928
active_anon 4096
inactive_file 204800
active_file 1417216
unevictable 0
hierarchical_memory_limit 268435456
hierarchical_memsw_limit 9223372036854771712
total_cache 1622016
total_rss 18446744073706844160
total_rss_huge 0
total_shmem 0
total_mapped_file 159744
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 231326246
total_pgpgout 231326507
total_pgfault 347341748
total_pgmajfault 75
total_inactive_anon 18446744073706860544
total_active_anon 0
total_inactive_file 204800
total_active_file 1417216
total_unevictable 0
In this case, the RSS value must be ignored to avoid triggering some “high memory” monitors.
Additional Notes
- DataDog/integrations-core#13076
Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
Load the agent on a big cluster and look for spikes in max:container.memory.rss{$datacenter,$kube_cluster_name,$host} by {datacenter,kube_cluster_name,host}.fill(null)
While the metric should still be there, the spikes should disappear with an agent having this change.
Reviewer's Checklist
- [ ] If known, an appropriate milestone has been selected; otherwise the
Triage
milestone is set. - [ ] Use the
major_change
label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote. - [ ] A release note has been added or the
changelog/no-changelog
label has been applied. - [ ] Changed code has automated tests for its functionality.
- [ ] Adequate QA/testing plan information is provided if the
qa/skip-qa
label is not applied. - [ ] At least one
team/..
label has been applied, indicating the team(s) that should QA this change. - [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
- [ ] If applicable, the
need-change/operator
andneed-change/helm
labels have been applied. - [ ] If applicable, the
k8s/<min-version>
label, indicating the lowest Kubernetes version compatible with this feature. - [ ] If applicable, the config template has been updated.