amazon-cloudwatch-agent icon indicating copy to clipboard operation
amazon-cloudwatch-agent copied to clipboard

cadvisor metrics collection breaks on hosts using cgroups v2

Open kiyose opened this issue 2 years ago • 11 comments

We are running Flatcar > 3033.x.x in our kubernetes clusters and have undergone the cgroups v2 upgrade. After upgrading Container Insights is no longer able to retrieve cadvisor metrics at cadvisor_linux.go#L95. This appears to be directly related to google/cadvisor issue #3026 "Missing process metrics in cgroup v2" which is fixed in v.0.44.0 of cadvisor.

The cadvisor version used in Container Insights is quite old at v0.36.0 and there are several other PRs related to cgroups v2 between these versions.

kiyose avatar Apr 04 '22 18:04 kiyose

v0.44.0 looks like it is in still pre-release (I interpret this as non-stable). I agree that we should fix this, but not super confident that we should jump on v44 if they haven't deemed it as stable enough to be used as a release yet.

SaxyPandaBear avatar Apr 04 '22 19:04 SaxyPandaBear

@SaxyPandaBear, thanks for the quick reply. This is actively breaking our metrics collection so I need to find a work around. Building our own image is not out of the question. Has there been any work to upgrade cadvisor? I didn't see any branches that might be a good candidate to start with.

kiyose avatar Apr 04 '22 20:04 kiyose

Upgrading it hasn't been on my radar, not sure about others. Are you suggesting that we cut a branch off of this repo with the experimental version of cadvisor? I think it'd be easier, if building an image on your own is viable, to fork the repo, and update cadvisor in your fork in the interim.

SaxyPandaBear avatar Apr 04 '22 23:04 SaxyPandaBear

@SaxyPandaBear, i've forked the repository and upgraded cadvisor to 0.44.0. It was also necessary to make a small change in cadvisor/container_info_processor.go to remove the docker scope and allow look up in the podKeys map.

After the upgrade and when running in debug mode there is a new warning.

D! collect data from cadvisor...
W0411 22:57:44.134669       1 manager.go:159] Cannot detect current cgroup on cgroup v2

and there is also a substantial increase in merge conflicts.

D! metric being merged has conflict in fields

but other than those messages, it seems to be reporting accurate metrics.

You can see the changes in my fork here

kiyose avatar Apr 12 '22 21:04 kiyose

BTW @kiyose we're looking to upgrade the underlying telegraf version since it's gotten stale, and are planning, AFAIK, to incorporate the bump to the cadvisor version in #437

SaxyPandaBear avatar May 23 '22 02:05 SaxyPandaBear

As an update, we're prepping for a release of v1.247353, which bumps up our version of Golang used for building the agent as well as a bunch of overdue dependency updates, including this cadvisor version bump. See https://github.com/aws/amazon-cloudwatch-agent/blob/v1.247353.0/go.mod#L67

Targeting July 1 for release

SaxyPandaBear avatar Jun 23 '22 02:06 SaxyPandaBear

@SaxyPandaBear It looks like there may still be an issue with the metrics being collected. I don't see the code to account for the pod infoname. See line 164 of my PR.

Thanks!

kiyose avatar Jun 27 '22 13:06 kiyose

Ah didn't see this. Will take a look in the morning

SaxyPandaBear avatar Jul 14 '22 01:07 SaxyPandaBear

@SaxyPandaBear, wanted to see if you have had a chance to take a look. Thanks!

kiyose avatar Aug 22 '22 18:08 kiyose

I haven't looked into this, unfortunately. Is the pod infoname the only part that needs to be addressed? Idk if we have test coverage that accounts for the cgroups change to be honest so might have to look into bolstering that too.

SaxyPandaBear avatar Aug 22 '22 18:08 SaxyPandaBear

@SaxyPandaBear, I believe that's it. W/o addressing this you aren't able to look up podKeys by info.name

func stripDockerScopeFromContainerInfoName(name string) string {
	// Hack to remove trailing extraneous information from InfoName and allow lookup by pod info.
	// Keys look like:
	//   /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1867bd7f_a98c_4a37_899f_857df482c984.slice/docker-d704a4d084df050c1f5f4f0763f55e8f75d85bf37525844b9234afcd86a1b252.scope
	// everything following the /docker needs to be removed.
	return regexp.MustCompile(`/docker-.+$`).ReplaceAllString(name, "")
}
...
	podKey, ok := podKeys[info.Name]
	if !ok {
		// Try again after removing trailing /docker...scope$
		podKey, ok = podKeys[stripDockerScopeFromContainerInfoName(info.Name)]
	}

See these two sections of code in situ here: https://github.com/kiyose/amazon-cloudwatch-agent/blob/cadvisor-upgrade-0.44.0/plugins/inputs/cadvisor/container_info_processor.go#L164-L170

https://github.com/kiyose/amazon-cloudwatch-agent/blob/cadvisor-upgrade-0.44.0/plugins/inputs/cadvisor/container_info_processor.go#L183-L187

kiyose avatar Sep 27 '22 18:09 kiyose

@SaxyPandaBear Bumping this to see if anyone has had a chance to review. This is blocking an upgrade for us.

kiyose avatar Feb 14 '23 23:02 kiyose

Escalating internally to see how to prioritize this

SaxyPandaBear avatar Feb 15 '23 16:02 SaxyPandaBear

@SaxyPandaBear We have updated to the latest version and we are no longer seeing the issue. I'm closing this issue.

kiyose avatar Mar 23 '23 22:03 kiyose

Apologies that the fix you requested didn't get prioritized. I'm glad that the latest version resolves the issue though!

SaxyPandaBear avatar Mar 24 '23 03:03 SaxyPandaBear