aws-otel-community
aws-otel-community copied to clipboard
Instrumenting Prometheus metrics on an ECS service with multiple instances of task definition
Hello,
I'm working on a prototype to instrument application metrics to Prometheus for an application hosted in ECS with an application load balanced fargate service. The AWS OTEL collector is running as a sidecar container - as below:
This is working well with a single task, however, with multiple instantiations of the task definition, I'm unable to distinguish which instance the metric has been pulled from, resulting in inaccurate data.
Each metric has an instance label, however, it's the same value across all instances, 0.0.0.0:8080, which is the scrape target.
As a result, each collector is writing the same metric plus labels to AWS Managed Prometheus and I have found no way to distinguish.
I tried tried the following config - resource_to_telemetry_conversion: enabled: true
Which added labels service_name and service_instance_id, however the values are not unique - aws-otel-app (job_name) and 0.0.0.0:8080 (scrape_target) for all metrics.
I can see the target metric has the following ECS related labels: aws_ecs_cluster_name, aws_ecs_launchtype, aws_ecs_service_name, aws_ecs_task_arn, aws_ecs_task_family, aws_ecs_task_id, aws_ecs_task_known_status, aws_ecs_task_launch_type, aws_ecs_task_pull_started_at, aws_ecs_task_pull_stopped_at, aws_ecs_task_revision.
Applying the aws_ecs_task_id label to all other metrics would be useful but I have been unable to do this successfully.
I appreciate any help or guidance, Thanks, Matt
adot-config -
receivers:
prometheus:
config:
global:
scrape_interval: 30s
scrape_timeout: 10s
scrape_configs:
- job_name: "aws-otel-app"
honor_labels: true
static_configs:
- targets: ["0.0.0.0:8080"]
awsecscontainermetrics:
collection_interval: 10s
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681
processors:
filter:
metrics:
include:
match_type: strict
metric_names:
- ecs.task.memory.utilized
- ecs.task.memory.reserved
- ecs.task.cpu.utilized
- ecs.task.cpu.reserved
- ecs.task.network.rate.rx
- ecs.task.network.rate.tx
- ecs.task.storage.read_bytes
- ecs.task.storage.write_bytes
memory_limiter:
limit_mib: 100
check_interval: 5s
exporters:
awsprometheusremotewrite:
endpoint: https://aps-workspaces.eu-west-2.amazonaws.com/workspaces/ws-{workspace-id}/api/v1/remote_write
aws_auth:
region: {region}
service: aps
resource_to_telemetry_conversion:
enabled: true
logging:
loglevel: info
awsxray:
region: {region}
index_all_attributes: true
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [prometheus]
exporters: [logging, awsprometheusremotewrite]
metrics/ecs:
receivers: [awsecscontainermetrics]
processors: [filter]
exporters: [logging, awsprometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter]
exporters: [awsxray]