ecs_exporter icon indicating copy to clipboard operation
ecs_exporter copied to clipboard

Explanation of ecs_memory_limit_bytes

Open jseiser opened this issue 3 years ago • 4 comments

Can anyone explain what this metric is actually returning the bytes of?

# HELP ecs_memory_limit_bytes Memory limit in bytes.
# TYPE ecs_memory_limit_bytes gauge
ecs_memory_limit_bytes{container="heartbeat"} 9.223372036854772e+18
ecs_memory_limit_bytes{container="log_router"} 9.223372036854772e+18
ecs_memory_limit_bytes{container="prom_exporter"} 9.223372036854772e+18

9223372036854772000 bytes is how prometheus is reading that, which is like 9223372036.8547725677 Gigabytes.

The task definition itself is defined with

memory                   = "1024"

And the containers inside the task definition container

prom_exporter memoryReservation : 100 heartbeat memoryReservation : 256 log_router memoryReservation : 100

If i grab a performance metric log from ECS for the heartbeat container

ContainerName | heartbeat
CpuReserved | 0.0
CpuUtilized | 0.8063517761230469
MemoryReserved | 256
MemoryUtilized | 61

Would be great if anyone could help me understand what exactly I am looking at here. I really just want to be able to track the memory usage of my fargate containers.

jseiser avatar Jul 22 '22 16:07 jseiser

OK, bumping up to the main container, I can get a total memory used that makes more sense.

sum by (ecs_task_id, container) (ecs_memory_bytes{} + ecs_memory_cache_usage{})

But Im still not sure what/how ecs_memory_limit_bytes is. Basically, I can tell my memory usage + cache for each container, but have no way of saying its using X% of its reservation, or X% of the total allocated memory.

I would expect to be able to get either MemoryReserved or the Total Memory Available to the container, so you could determine if the container needs more or less memory allocated.

jseiser avatar Jul 22 '22 17:07 jseiser

So I assume its coming from here:

https://github.com/moby/moby/blob/v20.10.17/api/types/stats.go#L59

	// number of times memory usage hits limits.
	Failcnt uint64 `json:"failcnt,omitempty"`
	Limit   uint64 `json:"limit,omitempty"`

This cant really be the memory limit though, the number is way to high, since we have already limited the entire Task to 1 GB.

jseiser avatar Aug 03 '22 13:08 jseiser

Ok hit the same, so the metadata v4 api the ecs-exporter scrapes is documented here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4.html

This path returns Docker stats for the specific container. For more information about each of the returned stats, see ContainerStats in the Docker API documentation.

ContainerStats come from https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

So the limit is the limit within a cgroup that can be nested in other cgroups. If your task has a memory limit but not your container, the container c(sub)group has effectivelly no limit (in my case it's set to 8EiB) but is limited by the limit in the parent, task cgroup. This limit isn't exposed in the tasks API though since docker itself doesn't expose it since docker doesn't deal with nested cgroups. Fortunately there is a hierarchical_memory_limit which should give us what we want and it should be easy to add this to the exporter.

For now though #53 added ecs_svc_memory_limit_bytes (which IMO should be called ecs_task_memory_limit_bytes fwiw, but I think there should be plenty of other renaming being done)

discordianfish avatar Mar 17 '23 15:03 discordianfish

@discordianfish Feel free to send a renaming PR.

SuperQ avatar Mar 19 '23 16:03 SuperQ

Fortunately there is a hierarchical_memory_limit which should give us what we want and it should be easy to add this to the exporter.

I can confirm this is correct. When there is a task-level memory limit but not a container-level one (common in Fargate), the memory limit is nonsense but stats.hierarchical_memory_limit is correct (in my case, 512Mi, the smallest Fargate task limit).

Meanwhile, in my sample EC2 task (without task-level limits and with container-level limits, as is typical there), there is no stats.hierarchical_memory_limit at all, and limit is correct (256Mi). So we'd have to prefer the former but fall back to the latter to be correct for both use cases.

However, I think there might be a more direct option, which is to use the data exposed by the task metadata API to set the container memory limit gauge. Here're the Fargate and EC2 task metadata for the same tasks as above. It simply tells you what the container-level and task-level limits are, which seems to be less janky than interfacing with cgroups.

I have this and other improvements sitting in my fork's develop branch, queued up behind #75.

isker avatar Oct 07 '24 05:10 isker