telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

inputs.docker mem usage stats are computed wrong for cgroup v1

Open fmorillo opened this issue 3 years ago • 2 comments

Relevant telegraf.conf

[[inputs.docker]]                         
  endpoint = "unix:///var/run/docker.sock"

Logs from Telegraf

-

System info

Telegraf 1.23.3 , Ubuntu 20.04.4, Docker 20.10.17

Docker

No response

Steps to reproduce

Just get the mem usage stats from a docker host with cgroups v1.

Expected behavior

Compute mem usage like the docker cli. https://github.com/docker/cli/blob/5c511f4f856b11311557888bbb6a7ada218fef71/cli/command/container/stats_helpers.go#L239

Actual behavior

The mem cache is substracted from the mem usage, which is wrong. https://github.com/influxdata/telegraf/blob/272a0a335500070f764c733c281931f082f91f93/plugins/inputs/docker/stats_helpers.go#L54

Additional info

That mem chache is substracted from the mem usage was the default behaviour of docker cli 19.03 and before. So in the calculateMemUsageUnixNoCache function the version is determined by checking the existence of the cache field. https://github.com/influxdata/telegraf/blob/272a0a335500070f764c733c281931f082f91f93/plugins/inputs/docker/stats_helpers.go#L53 But a cache field is part of cgroups version 1 and does not say anything about the docker version. Also that compution was wrong and therefore fixed. So there is no reason to restore that behavior, even for older hosts.

fmorillo avatar Aug 02 '22 14:08 fmorillo

The change occurred in https://github.com/influxdata/telegraf/pull/10491

@101100 Thoughts on removing vs keeping the check for the cache field? It does look like upstream does not have this same check.

powersj avatar Aug 02 '22 15:08 powersj

@powersj The upstream doesn't have the check because it knows it is the latest Docker version. 19.03 is pretty old, though, so eliminating the code for that version would be OK and wouldn't affect me.

Another option would be to just prioritize the latest version and fallback on cache. It might be that total_inactive_file and inactive_file aren't in the old version, so you could just try the fields for the latest version and fallback on cache by inverting the order of the if statements:

func CalculateMemUsageUnixNoCache(mem types.MemoryStats) float64 {
	// Latest Docker and cgroup v1
	if v, isCgroup1 := mem.Stats["total_inactive_file"]; isCgroup1 && v < mem.Usage {
		return float64(mem.Usage - v)
	}
	// Latest Docker and cgroup v2
	if v := mem.Stats["inactive_file"]; v < mem.Usage {
		return float64(mem.Usage - v)
	}
	// Docker 19.03 and older fallback
	if v, isOldDocker := mem.Stats["cache"]; isOldDocker && v < mem.Usage {
		return float64(mem.Usage - v)
	}
	return float64(mem.Usage)
}

101100 avatar Aug 08 '22 17:08 101100

@fmorillo sorry for the delay on this one. Would you be willing to try the artifacts on PR #12616. There should be artifacts attached in 20-30mins that you can use.

Thanks!

powersj avatar Feb 03 '23 20:02 powersj

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!

telegraf-tiger[bot] avatar Feb 18 '23 18:02 telegraf-tiger[bot]