containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] [Add Container-level Metrics]: Add Container-Level CPU & Memory metrics

Open rehevkor5 opened this issue 4 years ago • 12 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request CloudWatch metrics should include container-level metrics for CPU and Memory use (for each replica). Ideally this would be queryable by Service, and visualized with one graph line for each container+replica.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? When tuning CPU & memory requests, or diagnosing issues with containers getting killed due to going over memory limits, it's necessary to determine which of the several possible containers in a Task is having issues by understanding the CPU & memory use of each container.

Currently, it's impossible to tell how much CPU & Memory a specific Container is using, so it's impossible to tell which Container in a Task might be going over its memory limit, or which Container might benefit from more CPU. CloudWatch only shows statistics which are generated at the Task+Replica level and only queriable as a summary by Service. The summary metric is misleading because it might show that only 50% of memory is being used (max per instant across replicas), but in actuality one Container might be using >100% of its memory, while another container might be using <10%.

Are you currently working around this issue? Trial-by fire and experimentation... Launch the service, observe if possible, including SSH into specific EC2 to look at output of stuff like docker ps and docker stats (not a very reliable procedure as often by the time I've logged in, the container is already killed), make a guess adjustment to the ECS configuration, launch it again, repeat until stuff happens to work.

Additional context None.

Attachments None.

rehevkor5 avatar May 11 '20 15:05 rehevkor5

@rehevkor5 ECS Container level metrics are available as part of Container Insights feature https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html . Drilling further into task level metrics shows individual containers' CPU and memory resources consumed. Is there anything else that you're looking for?

sharanyad avatar May 12 '20 00:05 sharanyad

@sharanyad none of those seem to have a dimension based on task like the EKS container insights has around pods (eg pod_cpu_utilization in https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html).

I do notice that I can see a list of containers in the "Container performance" part of the performance monitoring part of Cloudwatch insights for ECS tasks. Unfortunately it's hard to link that back to a specific task (there's no task ID that I can see there) and it also only shows the average CPU and memory usage across the duration of that task instead of a graph showing resource usage over time.

Am I missing something there? It feels very close to showing what I need but not quite there.

We're considering using Metricbeat to ship the data from docker stats (https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-docker.html) but would prefer a more out of the box experience.

tomelliff avatar Jun 30 '20 15:06 tomelliff

ECS metrics out of the box are not there yet. But detailed metrics are available via the ECS Metadata stats URL (ECS_CONTAINER_METADATA_URI env var), that provides stats for fargate tasks as well. We built a side-car to export these metrics: https://github.com/Spaced-Out/ecs-container-exporter

raags avatar Jul 07 '20 06:07 raags

A few additional issues with the current container-level metrics available from Container Insights:

  1. No cpu data is reported at the task or container level unless you specify a cpu reservation/limit for the task/container. This makes things a lot less useful for those who don't want/need to set cpu reservations/limits. It would be nice if a sensible default was used for calculating the cpu usage percentage when no reservation is specified. E.g. # either 1024 (which would match docker stats output) or vCPUs X 1024 (which should match total usage).

  2. ~~The display of the container level data is inconsistent for me. It doesn't automatically show up when I view the ECS Tasks dashboard, it seems like I have to wait 30-60s for it to show up. Also, there is no indication of what period of time the average memory and cpu usage is captured for.~~

Edit: Ignore the second issue, it seems to be environment specific, so it is probably just a software bug.

talawahtech avatar Mar 09 '21 16:03 talawahtech

Would like to clarify, the ContainerInsights enables metrics dimension listed on: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html

* TaskDefinitionFamily, ClusterName

* ServiceName, ClusterName

* ClusterName

However, that doesn't come with dimension like ContainerName or so. Please note that a ServiceName here may come with multiple containers internally, while we do want to look into each inner container's performance. We currently haven't set cpu reservations/limits, my understanding is even with cpu reservations/limits setup, we won't get per container level CPU metrics (to be displayed on a timeseries graph in cloudwatch)

Please correct me if my understanding is wrong.

If my understanding is correct, then the existing Cloudwatch ContainerInsight is not the solution for this issue. We need either a different solution provided by AWS or an enhanced ContainerInsight to have ContainerName dimension.

spoilgo avatar Dec 27 '21 20:12 spoilgo

We really need these container level metrics for Fargate in order to effectively monitor our applications. There are instances where a container could be using 100% of it's allocated CPU but the task CPU usage only shows 30% usage.

Currently we have no way of knowing/monitoring this.

The docs by @sharanyad suggest it's there for EC2 backed ECS but there's no Container ID dimension on the Fargate backed ECS metrics.

jameselderxe avatar Apr 13 '22 10:04 jameselderxe

I am joining late this thread. To overcome some of the limitations mentioned in this thread I built this custom dashboard: https://github.com/mreferre/container-insights-custom-dashboards/tree/master/fargate-right-sizing. It was meant for "right sizing" but it could obviously be useful for other use cases.

Note that it digs up until task-level granularity (to overcome the default Container Insights task-definition-level granularity). I did not go all the way to container-level granularity (and honestly I don't even remember if that was because it was not available when I built this - 2+ years ago or I did not deem it necessary at the time). However container-level granularity should be possible today (and it seems to be it's possible to correlate containers to task_ids contrary to what someone was alluding to? or am I missing something?).

HTH.

mreferre avatar Apr 22 '22 10:04 mreferre

@mreferre The issue is that container level metrics are only available for EC2 backed ECS and not Fargate, it's documented here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html

Container Level Metrics

The other metrics in the docs, in the table above that one, don't have the Container ID dimension on them

jameselderxe avatar Apr 22 '22 16:04 jameselderxe

Yes, but those are metrics. The link I pasted above talks about the "performance log events" which seem to include container level numbers:

image

So you won't be able to set alarms on those or do anything you'd do with metrics BUT you can generate dashboards and extract meaningful information using Cloudwatch Log Insights (see the GH repo I linked above as an example)

mreferre avatar Apr 22 '22 21:04 mreferre

+1 I want to can see any metrics per task id at container insight.

wsscc2021 avatar May 27 '22 04:05 wsscc2021

You can see this stuff, although you have to use cloudwatch insights to pull data logged by container insights.

Here's an example query to summarize a cluster/container's cpu and memory usage:

fields @message
| filter Type="Container"
| filter @logStream like /FargateTelemetry/
| stats  latest(ClusterName) as Cluster, max(CpuReserved) as MaxCpuReserved, avg(CpuUtilized) as AvgCpuUtilized, max(CpuUtilized) as PeakCpuUtilized, ceil(avg(MemoryUtilized)) as AvgMemUtilized, max(MemoryUtilized) as PeakMemUtilized by ContainerName
| sort ContainerName asc

sblack4 avatar Jul 19 '22 21:07 sblack4

Unfortunately that query doesn’t appear to give the memory usage per container but instead gives the memory usage at the service/cluster level.

This is apparent if you are running multiple containers in a task and multiple tasks as you’ll see a memory utilised percentage above 100%.

jameselderxe avatar Jul 27 '22 07:07 jameselderxe