aws-otel-collector ECS container.cpu.utilized metric unit claification

ECS container.cpu.utilized metric unit claification

Open tomiszili opened this issue 1 year ago • 1 comments

Hello!

What is the unit of the ECS container.cpu.utilized metric? Please help me to understand what is the unit of the container.cpu.utilized metric, because it does not align with the CPUUtilization of the EC2 instance.

Jul 22 '22 08:07 tomiszili

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

Sep 25 '22 20:09 github-actions[bot]

Hello, team.

We've been trying to figure out how container.cpu.utilized metric is calculated, but no luck.

Platform AWS ECS (Fargate or EC2, observed on both capacity providers) ADOT version: v0.22.0

We have few observations, shared below:

Initially, we set CPU value on Task level only, leaving container level CPU at 0

{
    ...
    "containerDefinitions": [
      {
        "cpu": 0,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "8192",
    ...
}

Container level CPU value of 0 is getting converted to 2 in background to be passed to docker run --cpu-shares. Explanation of this behavior is here and here

This can be confirmed by graphing metric container.cpu.reserved for Task that has container level CPU set to 0

And in that case container.cpu.utilized looks sane

Next, we set container level CPU to non-zero value like

{
    ...
    "containerDefinitions": [
      {
        "cpu": 7168,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "8192",
    ...
}

This is correctly reflected in container.cpu.reserved metric

But container.cpu.utilized gets messed up

Questions:

How is container.cpu.utilized related to container.cpu.reserved?
Which CloudWatch metrics are used for container.cpu.utilized?

Thank you.

Oct 27 '22 12:10 imishchuk-carbon

@imishchuk-carbon can you share the collector config you are using?

Oct 27 '22 15:10 bryan-aguilar

Hey @bryan-aguilar Thanks for looking into this. Config below

extensions:
  health_check:
  sigv4auth:
    region: "us-east-1"

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  awsxray:
    endpoint: 0.0.0.0:2000
    transport: udp
  awsecscontainermetrics:
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-envoy-eg
          scrape_interval: 5s
          metrics_path: /stats/prometheus
          static_configs:
            - targets: ["localhost:9901"]
              labels:
                __ecs_container_metadata_uri: ${ECS_CONTAINER_METADATA_URI}
          relabel_configs:
            - source_labels: [__ecs_container_metadata_uri]
              target_label: ecs_task_id
              regex: '.*?/([a-z0-9]+)-\d+$'
processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 20
    spike_limit_percentage: 15
  batch/traces:
    timeout: 5s
    send_batch_size: 50
  batch/metrics:
    timeout: 60s
  resourcedetection:
    detectors:
      - env
      - system
      - ecs
      - ec2

exporters:
  otlphttp:
    endpoint: "${OTEL_COLLECTOR_ENDPOINT}"
  awsemf:
    namespace: ECS/AWSOTel/Application
    log_group_name: '/aws/ecs/application/metrics'
    region: "${OTEL_EXPORT_AMP_REGION}"
  # AWS Managed Prometheus Collector configuration
  prometheusremotewrite:
    endpoint: "${OTEL_EXPORTER_AMP_ENDPOINT}"
    auth:
      authenticator: sigv4auth
    resource_to_telemetry_conversion:
      enabled: true
    remote_write_queue:
      enabled: true
      num_consumers: 1
      queue_size: 5000
  logging:
    loglevel: debug


service:
  pipelines:
    traces:
      receivers: [otlp, awsxray]
      processors: [memory_limiter, resourcedetection, batch/traces]
      exporters: [otlphttp]
    metrics/application:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, batch/metrics]
      exporters: [prometheusremotewrite]
    metrics/envoy:
      receivers: [prometheus]
      processors: [memory_limiter, batch/metrics]
      exporters: [prometheusremotewrite]
    metrics:
      receivers: [awsecscontainermetrics]
      processors: [memory_limiter, batch/metrics]
      exporters: [otlphttp]

  extensions: [sigv4auth,health_check]

Oct 27 '22 15:10 imishchuk-carbon

CPU utilization is calculated as the Container CPU Usage / Container CPU Reserved. You can see that in the receiver code here.

Oct 27 '22 16:10 bryan-aguilar

if containerMetrics.CPUReserved > 0 {
    containerMetrics.CPUUtilized = (containerMetrics.CPUUtilized / containerMetrics.CPUReserved)
}

Okay, can you help me understand the logic behind this calculation, please? Why is containerMetrics.CPUUtilized redefined as its own value divided by containerMetrics.CPUReserved?

containerMetrics.CPUReserved is never 0, minimal it gets is 2. And when it's set to 2 it means that container does not have a guaranteed CPU share in Task. E.g.

{
    ...
    "containerDefinitions": [
      {
        "cpu": 0,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "2048",
    ...
}

Container metadata

curl -s  $ECS_CONTAINER_METADATA_URI_V4/task | jq  -rc '.Containers[] | "\(.Name): \(.Limits)"'
firelens: {"CPU":2}
app: {"CPU":2}
otel-collector: {"CPU":2}
envoy: {"CPU":2}

In light of the above, I think comparison if containerMetrics.CPUReserved > 0 should be changed to if containerMetrics.CPUReserved > 2 because 2 (and null, 0, 1) is a special case.

To clarify this comment, there are two types of Capacity providers available for ECS and they have different requirements for CPU configuration:

FARGATE - Task level CPU is required, Container level is optional
Non-Fargate (EC2 or External) - Both settings are optional

So following combinations are possible FARGATE:

Task level CPU is set, Container level CPU is not set (defaults to 2)
Task level CPU is set, Container level CPU is set

EC2:

Task level CPU is not set (Task can consume all CPU in EC2), Container level CPU is not set. What would be calculated instead of this?

{
    ...
    "Limits": {
        "Memory": 3584
    },
    ...

Task level CPU is set, Container level CPU is not set
Task level CPU is set, Container level CPU is set

Thank you.

Oct 28 '22 09:10 imishchuk-carbon

We're going to take a deeper look at this and will get back to this issue when we have an update. Thanks for bringing our attention back to this!

Oct 28 '22 18:10 bryan-aguilar

Looking forward to updates. Thank you.

Nov 02 '22 12:11 imishchuk-carbon

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

Jan 01 '23 20:01 github-actions[bot]

This issue was closed because it has been marked as stale for 30 days with no activity.

Feb 05 '23 20:02 github-actions[bot]

What this metric represents is unclear and not explained in the docs, could this issue be re-opened please?

Aug 01 '23 10:08 benabineri

Can we please reopen this ticket? The docs are misleading. Here we can see that it says None for both ecs.task.cpu.utilized and container.cpu.utilized: https://aws-otel.github.io/docs/components/ecs-metrics-receiver Screenshot 2023-08-18 at 04 19 28

However this is false, because as it is pointed out here on this comment and here, these are PERCENTAGES! Screenshot 2023-08-18 at 04 23 21

Just lost a couple of hours trying to make any sense of these values... Please update the AWS distro page with the right units!

Aug 18 '23 03:08 bmbferreira

I have updated the AWS Distro page with the correct units for those metrics. Is there any more explanation needed here? If not, we can close the issue.

Aug 29 '23 16:08 humivo

Closing this issue now that the metric unit has been clarified and there are no other questions.

Sep 15 '23 20:09 humivo

aws-otel-collector aws-otel-collector copied to clipboard

ECS container.cpu.utilized metric unit claification

aws-otel-collector
aws-otel-collector copied to clipboard