amazon-cloudwatch-agent
amazon-cloudwatch-agent copied to clipboard
cadvisor metrics collection breaks on hosts using cgroups v2
We are running Flatcar > 3033.x.x in our kubernetes clusters and have undergone the cgroups v2 upgrade. After upgrading Container Insights is no longer able to retrieve cadvisor metrics at cadvisor_linux.go#L95. This appears to be directly related to google/cadvisor issue #3026 "Missing process metrics in cgroup v2" which is fixed in v.0.44.0 of cadvisor.
The cadvisor version used in Container Insights is quite old at v0.36.0 and there are several other PRs related to cgroups v2 between these versions.
v0.44.0 looks like it is in still pre-release (I interpret this as non-stable). I agree that we should fix this, but not super confident that we should jump on v44 if they haven't deemed it as stable enough to be used as a release yet.
@SaxyPandaBear, thanks for the quick reply. This is actively breaking our metrics collection so I need to find a work around. Building our own image is not out of the question. Has there been any work to upgrade cadvisor? I didn't see any branches that might be a good candidate to start with.
Upgrading it hasn't been on my radar, not sure about others. Are you suggesting that we cut a branch off of this repo with the experimental version of cadvisor? I think it'd be easier, if building an image on your own is viable, to fork the repo, and update cadvisor in your fork in the interim.
@SaxyPandaBear, i've forked the repository and upgraded cadvisor to 0.44.0. It was also necessary to make a small change in cadvisor/container_info_processor.go to remove the docker scope and allow look up in the podKeys map.
After the upgrade and when running in debug mode there is a new warning.
D! collect data from cadvisor...
W0411 22:57:44.134669 1 manager.go:159] Cannot detect current cgroup on cgroup v2
and there is also a substantial increase in merge conflicts.
D! metric being merged has conflict in fields
but other than those messages, it seems to be reporting accurate metrics.
BTW @kiyose we're looking to upgrade the underlying telegraf version since it's gotten stale, and are planning, AFAIK, to incorporate the bump to the cadvisor version in #437
As an update, we're prepping for a release of v1.247353, which bumps up our version of Golang used for building the agent as well as a bunch of overdue dependency updates, including this cadvisor version bump. See https://github.com/aws/amazon-cloudwatch-agent/blob/v1.247353.0/go.mod#L67
Targeting July 1 for release
@SaxyPandaBear It looks like there may still be an issue with the metrics being collected. I don't see the code to account for the pod infoname. See line 164 of my PR.
Thanks!
Ah didn't see this. Will take a look in the morning
@SaxyPandaBear, wanted to see if you have had a chance to take a look. Thanks!
I haven't looked into this, unfortunately. Is the pod infoname the only part that needs to be addressed? Idk if we have test coverage that accounts for the cgroups change to be honest so might have to look into bolstering that too.
@SaxyPandaBear, I believe that's it. W/o addressing this you aren't able to look up podKeys
by info.name
func stripDockerScopeFromContainerInfoName(name string) string {
// Hack to remove trailing extraneous information from InfoName and allow lookup by pod info.
// Keys look like:
// /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1867bd7f_a98c_4a37_899f_857df482c984.slice/docker-d704a4d084df050c1f5f4f0763f55e8f75d85bf37525844b9234afcd86a1b252.scope
// everything following the /docker needs to be removed.
return regexp.MustCompile(`/docker-.+$`).ReplaceAllString(name, "")
}
...
podKey, ok := podKeys[info.Name]
if !ok {
// Try again after removing trailing /docker...scope$
podKey, ok = podKeys[stripDockerScopeFromContainerInfoName(info.Name)]
}
See these two sections of code in situ here: https://github.com/kiyose/amazon-cloudwatch-agent/blob/cadvisor-upgrade-0.44.0/plugins/inputs/cadvisor/container_info_processor.go#L164-L170
https://github.com/kiyose/amazon-cloudwatch-agent/blob/cadvisor-upgrade-0.44.0/plugins/inputs/cadvisor/container_info_processor.go#L183-L187
@SaxyPandaBear Bumping this to see if anyone has had a chance to review. This is blocking an upgrade for us.
Escalating internally to see how to prioritize this
@SaxyPandaBear We have updated to the latest version and we are no longer seeing the issue. I'm closing this issue.
Apologies that the fix you requested didn't get prioritized. I'm glad that the latest version resolves the issue though!