aws-otel-collector
aws-otel-collector copied to clipboard
ECS container.cpu.utilized metric unit claification
Hello!
What is the unit of the ECS container.cpu.utilized metric?
Please help me to understand what is the unit of the container.cpu.utilized metric, because it does not align with the CPUUtilization of the EC2 instance.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.
Hello, team.
We've been trying to figure out how container.cpu.utilized
metric is calculated, but no luck.
Platform AWS ECS (Fargate or EC2, observed on both capacity providers)
ADOT version: v0.22.0
We have few observations, shared below:
Initially, we set CPU value on Task level only, leaving container level CPU at 0
{
...
"containerDefinitions": [
{
"cpu": 0,
"name": "app"
}
],
"requiresCompatibilities": [
"FARGATE"
],
"cpu": "8192",
...
}
Container level CPU value of 0
is getting converted to 2
in background to be passed to docker run --cpu-shares
. Explanation of this behavior is here and here
This can be confirmed by graphing metric container.cpu.reserved
for Task that has container level CPU set to 0
And in that case container.cpu.utilized
looks sane
Next, we set container level CPU to non-zero value like
{
...
"containerDefinitions": [
{
"cpu": 7168,
"name": "app"
}
],
"requiresCompatibilities": [
"FARGATE"
],
"cpu": "8192",
...
}
This is correctly reflected in container.cpu.reserved
metric
But container.cpu.utilized
gets messed up
Questions:
- How is
container.cpu.utilized
related tocontainer.cpu.reserved
? - Which CloudWatch metrics are used for
container.cpu.utilized
?
Thank you.
@imishchuk-carbon can you share the collector config you are using?
Hey @bryan-aguilar Thanks for looking into this. Config below
extensions:
health_check:
sigv4auth:
region: "us-east-1"
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
awsxray:
endpoint: 0.0.0.0:2000
transport: udp
awsecscontainermetrics:
prometheus:
config:
scrape_configs:
- job_name: otel-envoy-eg
scrape_interval: 5s
metrics_path: /stats/prometheus
static_configs:
- targets: ["localhost:9901"]
labels:
__ecs_container_metadata_uri: ${ECS_CONTAINER_METADATA_URI}
relabel_configs:
- source_labels: [__ecs_container_metadata_uri]
target_label: ecs_task_id
regex: '.*?/([a-z0-9]+)-\d+$'
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 20
spike_limit_percentage: 15
batch/traces:
timeout: 5s
send_batch_size: 50
batch/metrics:
timeout: 60s
resourcedetection:
detectors:
- env
- system
- ecs
- ec2
exporters:
otlphttp:
endpoint: "${OTEL_COLLECTOR_ENDPOINT}"
awsemf:
namespace: ECS/AWSOTel/Application
log_group_name: '/aws/ecs/application/metrics'
region: "${OTEL_EXPORT_AMP_REGION}"
# AWS Managed Prometheus Collector configuration
prometheusremotewrite:
endpoint: "${OTEL_EXPORTER_AMP_ENDPOINT}"
auth:
authenticator: sigv4auth
resource_to_telemetry_conversion:
enabled: true
remote_write_queue:
enabled: true
num_consumers: 1
queue_size: 5000
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp, awsxray]
processors: [memory_limiter, resourcedetection, batch/traces]
exporters: [otlphttp]
metrics/application:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, batch/metrics]
exporters: [prometheusremotewrite]
metrics/envoy:
receivers: [prometheus]
processors: [memory_limiter, batch/metrics]
exporters: [prometheusremotewrite]
metrics:
receivers: [awsecscontainermetrics]
processors: [memory_limiter, batch/metrics]
exporters: [otlphttp]
extensions: [sigv4auth,health_check]
CPU utilization is calculated as the Container CPU Usage / Container CPU Reserved
. You can see that in the receiver code here.
if containerMetrics.CPUReserved > 0 {
containerMetrics.CPUUtilized = (containerMetrics.CPUUtilized / containerMetrics.CPUReserved)
}
Okay, can you help me understand the logic behind this calculation, please? Why is containerMetrics.CPUUtilized
redefined as its own value divided by containerMetrics.CPUReserved
?
containerMetrics.CPUReserved
is never 0
, minimal it gets is 2
. And when it's set to 2
it means that container does not have a guaranteed CPU share in Task.
E.g.
{
...
"containerDefinitions": [
{
"cpu": 0,
"name": "app"
}
],
"requiresCompatibilities": [
"FARGATE"
],
"cpu": "2048",
...
}
Container metadata
curl -s $ECS_CONTAINER_METADATA_URI_V4/task | jq -rc '.Containers[] | "\(.Name): \(.Limits)"'
firelens: {"CPU":2}
app: {"CPU":2}
otel-collector: {"CPU":2}
envoy: {"CPU":2}
![image](https://user-images.githubusercontent.com/100925121/198534680-f91e9901-07ba-454e-bf9a-ae73964201a3.png)
In light of the above, I think comparison if containerMetrics.CPUReserved > 0
should be changed to if containerMetrics.CPUReserved > 2
because 2
(and null
, 0
, 1
) is a special case.
To clarify this comment, there are two types of Capacity providers available for ECS and they have different requirements for CPU configuration:
- FARGATE - Task level CPU is required, Container level is optional
- Non-Fargate (EC2 or External) - Both settings are optional
So following combinations are possible FARGATE:
- Task level CPU is set, Container level CPU is not set (defaults to
2
) - Task level CPU is set, Container level CPU is set
EC2:
- Task level CPU is not set (Task can consume all CPU in EC2), Container level CPU is not set. What would be calculated instead of this?
{
...
"Limits": {
"Memory": 3584
},
...
- Task level CPU is set, Container level CPU is not set
- Task level CPU is set, Container level CPU is set
Thank you.
We're going to take a deeper look at this and will get back to this issue when we have an update. Thanks for bringing our attention back to this!
Looking forward to updates. Thank you.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue was closed because it has been marked as stale for 30 days with no activity.
What this metric represents is unclear and not explained in the docs, could this issue be re-opened please?
Can we please reopen this ticket? The docs are misleading. Here we can see that it says None
for both ecs.task.cpu.utilized
and container.cpu.utilized
: https://aws-otel.github.io/docs/components/ecs-metrics-receiver
However this is false, because as it is pointed out here on this comment and here, these are PERCENTAGES!
Just lost a couple of hours trying to make any sense of these values... Please update the AWS distro page with the right units!
I have updated the AWS Distro page with the correct units for those metrics. Is there any more explanation needed here? If not, we can close the issue.
Closing this issue now that the metric unit has been clarified and there are no other questions.