How to use `ecs_cpu_seconds_total` metric

Open KristianAsp opened this issue 1 year ago • 1 comments

I'm trying to graph a basic CPU utilization overview using the ecs_cpu_seconds_total for my ECS tasks but I'm struggling to get something that matches what I can see in Cloudwatch.

I've noticed most, if not all, of the examples online relating to node_exporter and its cpu metric relies on also recording the idle time, which I don't think we do here.

Any suggestions on what the PromQL query will look like that will accurately graph the CPU utilization in percentages?

I tried a range of similar queries to the following with no luck

avg(rate(ecs_cpu_seconds_total{ecs_service_name="prod-example", container!="ecs-exporter"}[1m]))

100 - (100 - (100 * (avg(rate(ecs_cpu_seconds_total{ecs_service_name="prod-example", container!="ecs-exporter"}[1m])))))

The 2nd one was an attempt to reverse-engineer what the idle metric would be (100% - current usage should give the idle?) but it's not right. I can see the right trends, i.e. spikes in CPU are correctly reflected in my own graphs, but the values are incorrect.

Any ideas?

Dec 10 '24 13:12 KristianAsp

Hi @KristianAsp, sorry for the very delayed response.

We've just cut a release where many metrics, including this one, have been overhauled. As you can see in the snapshot output, there is a new metric called ecs_container_cpu_usage_seconds_total, a name which should make a bit more sense than ecs_cpu_seconds_total. If you update to 0.4.0, you will have to use the new metric in your queries. You can check out the README for more details on these metrics.

To your actual question:

I've noticed most, if not all, of the examples online relating to node_exporter and its cpu metric relies on also recording the idle time, which I don't think we do here.

Yes, there is no idle time available to ecs_exporter as there is in node_exporter. You can see the data available to ecs_exporter in the docker stats response here. It simply does not include any idle time. (And unfortunately only some undocumented subset of this data is actually available and correct in the ECS task stats API, which is what ecs_exporter is using.)

All we have available is the cumulative number of CPU-seconds used by the task, in ecs_container_cpu_usage_seconds_total. So, rate(ecs_container_cpu_usage_seconds_total[1m]) gives you CPU-seconds per second, i.e. how many vCPUs were being used at any time. As you observed, spikes in usage are represented in this metric alone. It can be useful enough to plot just this metric.

However, if you want a percentage utilization number like there is in CloudWatch, you then need to divide rate(ecs_container_cpu_usage_seconds_total[1m]) by the number of vCPUs available to the task. If you've configured a task-level CPU limit (which is required in Fargate but optional in EC2), that is available in ecs_task_cpu_limit_vcpus, so you could do something like sum(rate(ecs_container_cpu_usage_seconds_total[1m])) / sum(ecs_task_cpu_limit_vcpus). (You will need to use the correct label selectors in this expression that are specific to your setup.)

If your task does not have a CPU limit, your task has access to all the vCPUs of the EC2 instance on which it's running. ecs_exporter does not currently have a metric that can tell you this, and I'm not sure it's possible at all: I'll have to look into whether OnlineCPUs can help here. If you simply know the number, and if it's not likely to change, you could hard code it in the expression, like sum(rate(ecs_container_cpu_usage_seconds_total[1m])) / 16.

Hope this helps.

Mar 20 '25 02:03 isker