aws-otel-collector icon indicating copy to clipboard operation
aws-otel-collector copied to clipboard

How to prevent duplicate springboot metrics from OTel collector using awsemfexporter

Open georges-git opened this issue 2 years ago • 6 comments

Hello, I am running otel collector as sidecar for an app deployed in fargate. awsemfexporter is flooding the otel log with the messages indicating "Duplicate metric found". one example of the message is below.

Q1: How to prevent logging the WARN messages to otel log file? Q2: What is the cause of this?

2022-03-16T03:58:10.818Z WARN [email protected]/grouped_metric.go:66 Duplicate metric found { "kind": "exporter", "name": "awsemf/application", "Name": "jvm.gc.pause", "Labels": { "action": "end of minor GC", "cause": "G1 Evacuation Pause" } }

georges-git avatar Mar 18 '22 15:03 georges-git

Can you please provide a copy of your collector config?

bryan-aguilar avatar Mar 18 '22 16:03 bryan-aguilar

OTEL side car config value is: /etc/ecs/container-insights/otel-task-metrics-config.yaml It is the standard config from aws git repo. Here is the complete url: https://github.com/aws-observability/aws-otel-collector/blob/main/config/ecs/container-insights/otel-task-metrics-config.yaml

Looks like all the springboot metrics emitted by statsd are flooding the log files with WARN "Duplicate metric found" as indicated previously.

georges-git avatar Mar 18 '22 20:03 georges-git

Here is a simple Spring Boot application which produces offending metrics: https://github.com/p120ph37/statsd-example

As George mentioned above, this requires an otel collector configuration with statsd enabled, such as the container-insights/otel-task-metrics-config.yaml, or the base config.yaml with a statsd endpoint added like this:

receivers:
  statsd:
    endpoint: 0.0.0.0:8125
    aggregation_interval: 60s
... 
service:
  pipelines:
    traces:
      receivers: [otlp,awsxray]
      processors: [batch/traces]
      exporters: [awsxray]
    metrics:
      receivers: [statsd,otlp]
      processors: [batch/metrics]
      exporters: [awsemf]

In my example app, the specific offending metric is jvm.gc.pause. I forced this artificially by calling System.gc() for reproducibility's sake, but in a real-world scenario, GC events are often triggered by actual application load and activity, so the issue may not be immediately-obvious until the app comes under load.

Looking into how and where this metric is generated, we see that is is produced within micrometer-core, in the JvmGcMetrics class here. That code produces a Timer record each time GC runs. For stateless MeterRegistry implementations, such as micrometer-meterregistry-statsd, this causes each record to be emitted immediately by default, rather than accumulated into a histogram as would be the case for something stateful like micrometer-meterregistry-prometheus. This means that each time a GC event fires, a separate UDP packet is emitted to the statsd address.

In the case where the statsd collector is an aws-otel-collector instance, these Timer datapoints are handled separately by default, and passed through the pipeline to the exporter - which is awsemf in this particular scenario. Although the default pipeline does do some batching, the batchprocessor module does not aggregate Timer metrics into histograms, so it ends up passing them as an array of individual items to awsemfexporter, which sends them as separate EMF documents to the CloudWatch metrics logstream. This can be very inefficient in some cases, particularily when dealing with very-high-volume Timer metrics. The example GC metric I've discussed so far is seldom very high-frequency, but other metrics are, such as HikariCP's hikaricp.connections.* metrics (defined here), or especially Spring MVC's server.http.request, which emits once per request, and can easily amount to hundreds or thousands of EMF documents per batch! This runs the risk of overloading the CloudWatch logstream throttle, and of incurring unexpectedly-large costs for datapoint ingestion.

The most direct workaround right now appears to be to add a timer_histogram_mapping directive to the statsd collector:

receivers:
  statsd:
    endpoint: 0.0.0.0:8125
    aggregation_interval: 60s
    timer_histogram_mapping:
      - statsd_type: histogram
        observer_type: summary
      - statsd_type: timing
        observer_type : summary

That causes the statsdcollector to aggregate Timer datapoints into a summary, and as long as the aggregation_interval on that collector is equal to or less than any batchprocessor timeout value for that pipeline, you do not get the duplicate EMF documents or warnings due to frequent statsd Timer metrics.

Another possible workaround is to avoid using the micrometer-meterregistry-statsd and statsdcollector modules altogether for Spring Boot / Micrometer metrics, and instead use the micrometer-meterregistry-prometheus and the prometheuscollector modules, which tend to handle metric-aggregation more-efficiently. (A micrometer-meterregistry-otlp module might be good to have, but such a thing doesn't seem to exist quite yet - though the beginnings of it seem to be starting in the java-instrumentation project)

Besides the scenario where duplicate datapoints arrive due to frequent statsd Timer events, there is also an issue where flattening metric dimensions within the awsemfexporter by using a metrics_declarations directive can result in the flattened metrics being considered duplicates within the awsemfexporter module itself and generating warnings. In that scenario, adjusting the statsdcollector configuration isn't helpful because the duplication comes about later in the pipeline. The workaround for that issue is that it is usually more-reliable to flatten dimensions by using the metricstransformprocessor module's aggregate_labels action than to rely on awsemfexporter to do it, however that cannot be applied to "summary" datatypes due to open-telemetry/opentelemetry-collector-contrib#6504

p120ph37 avatar Mar 25 '22 22:03 p120ph37

Here is an example of the metrics logstream content before applying timer_histogram_mapping in the statsdcollector:

2022-03-24T19:12:42.324-05:00 {
    "_aws": {
        "CloudWatchMetrics": [
            {
                "Namespace": "default",
                "Dimensions": [
                    [
                        "action",
                        "cause"
                    ],
                    [],
                    [
                        "action"
                    ],
                    [
                        "cause"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "jvm.gc.pause"
                    }
                ]
            }
        ],
        "Timestamp": 1648167162324
    },
    "action": "end of major GC",
    "cause": "System.gc()",
    "jvm.gc.pause": 79
}
2022-03-24T19:12:52.309-05:00 {
    "_aws": {
        "CloudWatchMetrics": [
            {
                "Namespace": "default",
                "Dimensions": [
                    [
                        "action",
                        "cause"
                    ],
                    [],
                    [
                        "action"
                    ],
                    [
                        "cause"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "jvm.gc.pause"
                    }
                ]
            }
        ],
        "Timestamp": 1648167172309
    },
    "action": "end of major GC",
    "cause": "System.gc()",
    "jvm.gc.pause": 63
}
2022-03-24T19:13:02.091-05:00 {
    "_aws": {
        "CloudWatchMetrics": [
            {
                "Namespace": "default",
                "Dimensions": [
                    [
                        "action",
                        "cause"
                    ],
                    [],
                    [
                        "action"
                    ],
                    [
                        "cause"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "jvm.gc.pause"
                    }
                ]
            }
        ],
        "Timestamp": 1648167182091
    },
    "action": "end of major GC",
    "cause": "System.gc()",
    "jvm.gc.pause": 46
}

Versus after:

2022-03-25T16:32:03.808-05:00 {
    "_aws": {
        "CloudWatchMetrics": [
            {
                "Namespace": "default",
                "Dimensions": [
                    [
                        "action",
                        "cause"
                    ],
                    [],
                    [
                        "action"
                    ],
                    [
                        "cause"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "jvm.gc.pause"
                    }
                ]
            }
        ],
        "Timestamp": 1648243923808
    },
    "action": "end of major GC",
    "cause": "System.gc()",
    "jvm.gc.pause": {
        "Max": 62,
        "Min": 56,
        "Count": 3,
        "Sum": 176
    }
}

p120ph37 avatar Mar 25 '22 23:03 p120ph37

Hello, Any update on what is the cause of these Duplicate warning messages for springboot metrics sent via micrometer-statsd layer using ADOT standardard config.yaml available here - https://github.com/aws-observability/aws-otel-collector/blob/main/config/ecs/container-insights/otel-task-metrics-config.yaml

georges-git avatar Mar 30 '22 22:03 georges-git

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Aug 21 '22 20:08 github-actions[bot]

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Oct 30 '22 20:10 github-actions[bot]