aws-otel-community Instrumenting Prometheus metrics on an ECS service with multiple instances of task definition

Instrumenting Prometheus metrics on an ECS service with multiple instances of task definition

Open Ma11hewThomas opened this issue 2 years ago • 0 comments

Hello,

I'm working on a prototype to instrument application metrics to Prometheus for an application hosted in ECS with an application load balanced fargate service. The AWS OTEL collector is running as a sidecar container - as below:

This is working well with a single task, however, with multiple instantiations of the task definition, I'm unable to distinguish which instance the metric has been pulled from, resulting in inaccurate data.

Each metric has an instance label, however, it's the same value across all instances, 0.0.0.0:8080, which is the scrape target.

As a result, each collector is writing the same metric plus labels to AWS Managed Prometheus and I have found no way to distinguish.

I tried tried the following config - resource_to_telemetry_conversion: enabled: true

Which added labels service_name and service_instance_id, however the values are not unique - aws-otel-app (job_name) and 0.0.0.0:8080 (scrape_target) for all metrics.

I can see the target metric has the following ECS related labels: aws_ecs_cluster_name, aws_ecs_launchtype, aws_ecs_service_name, aws_ecs_task_arn, aws_ecs_task_family, aws_ecs_task_id, aws_ecs_task_known_status, aws_ecs_task_launch_type, aws_ecs_task_pull_started_at, aws_ecs_task_pull_stopped_at, aws_ecs_task_revision.

Applying the aws_ecs_task_id label to all other metrics would be useful but I have been unable to do this successfully.

I appreciate any help or guidance, Thanks, Matt

adot-config -

receivers:
  prometheus:
    config:
      global:
        scrape_interval: 30s
        scrape_timeout: 10s
      scrape_configs:
        - job_name: "aws-otel-app"
          honor_labels: true
          static_configs:
            - targets: ["0.0.0.0:8080"]
             
  awsecscontainermetrics:
    collection_interval: 10s
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:55681
processors:
  filter:
    metrics:
      include:
        match_type: strict
        metric_names:
          - ecs.task.memory.utilized
          - ecs.task.memory.reserved
          - ecs.task.cpu.utilized
          - ecs.task.cpu.reserved
          - ecs.task.network.rate.rx
          - ecs.task.network.rate.tx
          - ecs.task.storage.read_bytes
          - ecs.task.storage.write_bytes
  memory_limiter:
    limit_mib: 100
    check_interval: 5s
exporters:
  awsprometheusremotewrite:
    endpoint: https://aps-workspaces.eu-west-2.amazonaws.com/workspaces/ws-{workspace-id}/api/v1/remote_write
    aws_auth:
      region: {region}
      service: aps
    resource_to_telemetry_conversion:
      enabled: true
  logging:
    loglevel: info
  awsxray:
    region: {region}
    index_all_attributes: true
extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679
service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [logging, awsprometheusremotewrite]
    metrics/ecs:
      receivers: [awsecscontainermetrics]
      processors: [filter]
      exporters: [logging, awsprometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [memory_limiter]
      exporters: [awsxray]

Jun 14 '22 14:06 Ma11hewThomas

aws-otel-community aws-otel-community copied to clipboard

Instrumenting Prometheus metrics on an ECS service with multiple instances of task definition

aws-otel-community
aws-otel-community copied to clipboard