aws-otel-collector
aws-otel-collector copied to clipboard
How to prevent duplicate springboot metrics from OTel collector using awsemfexporter
Hello, I am running otel collector as sidecar for an app deployed in fargate. awsemfexporter is flooding the otel log with the messages indicating "Duplicate metric found". one example of the message is below.
Q1: How to prevent logging the WARN messages to otel log file? Q2: What is the cause of this?
2022-03-16T03:58:10.818Z WARN [email protected]/grouped_metric.go:66 Duplicate metric found { "kind": "exporter", "name": "awsemf/application", "Name": "jvm.gc.pause", "Labels": { "action": "end of minor GC", "cause": "G1 Evacuation Pause" } }
Can you please provide a copy of your collector config?
OTEL side car config value is: /etc/ecs/container-insights/otel-task-metrics-config.yaml It is the standard config from aws git repo. Here is the complete url: https://github.com/aws-observability/aws-otel-collector/blob/main/config/ecs/container-insights/otel-task-metrics-config.yaml
Looks like all the springboot metrics emitted by statsd are flooding the log files with WARN "Duplicate metric found" as indicated previously.
Here is a simple Spring Boot application which produces offending metrics: https://github.com/p120ph37/statsd-example
As George mentioned above, this requires an otel collector configuration with statsd enabled, such as the container-insights/otel-task-metrics-config.yaml
, or the base config.yaml
with a statsd
endpoint added like this:
receivers:
statsd:
endpoint: 0.0.0.0:8125
aggregation_interval: 60s
...
service:
pipelines:
traces:
receivers: [otlp,awsxray]
processors: [batch/traces]
exporters: [awsxray]
metrics:
receivers: [statsd,otlp]
processors: [batch/metrics]
exporters: [awsemf]
In my example app, the specific offending metric is jvm.gc.pause
. I forced this artificially by calling System.gc()
for reproducibility's sake, but in a real-world scenario, GC events are often triggered by actual application load and activity, so the issue may not be immediately-obvious until the app comes under load.
Looking into how and where this metric is generated, we see that is is produced within micrometer-core
, in the JvmGcMetrics
class here. That code produces a Timer
record each time GC runs. For stateless MeterRegistry
implementations, such as micrometer-meterregistry-statsd
, this causes
each record to be emitted immediately by default, rather than accumulated into a histogram as would be the case for something stateful like micrometer-meterregistry-prometheus
. This means that each time a GC event fires, a separate UDP packet is emitted to the statsd
address.
In the case where the statsd
collector is an aws-otel-collector
instance, these Timer
datapoints are handled separately by default, and passed through the pipeline to the exporter - which is awsemf
in this particular scenario. Although the default pipeline does do some batching, the batchprocessor
module does not aggregate Timer
metrics into histograms, so it ends up passing them as an array of individual items to awsemfexporter
, which sends them as separate EMF documents to the CloudWatch metrics logstream. This can be very inefficient in some cases, particularily when dealing with very-high-volume Timer
metrics. The example GC metric I've discussed so far is seldom very high-frequency, but other metrics are, such as HikariCP's hikaricp.connections.*
metrics (defined here), or especially Spring MVC's server.http.request
, which emits once per request, and can easily amount to hundreds or thousands of EMF documents per batch! This runs the risk of overloading the CloudWatch logstream throttle, and of incurring unexpectedly-large costs for datapoint ingestion.
The most direct workaround right now appears to be to add a timer_histogram_mapping
directive to the statsd
collector:
receivers:
statsd:
endpoint: 0.0.0.0:8125
aggregation_interval: 60s
timer_histogram_mapping:
- statsd_type: histogram
observer_type: summary
- statsd_type: timing
observer_type : summary
That causes the statsdcollector
to aggregate Timer
datapoints into a summary, and as long as the aggregation_interval
on that collector is equal to or less than any batchprocessor
timeout
value for that pipeline, you do not get the duplicate EMF documents or warnings due to frequent statsd
Timer
metrics.
Another possible workaround is to avoid using the micrometer-meterregistry-statsd
and statsdcollector
modules altogether for Spring Boot / Micrometer metrics, and instead use the micrometer-meterregistry-prometheus
and the prometheuscollector
modules, which tend to handle metric-aggregation more-efficiently. (A micrometer-meterregistry-otlp
module might be good to have, but such a thing doesn't seem to exist quite yet - though the beginnings of it seem to be starting in the java-instrumentation project)
Besides the scenario where duplicate datapoints arrive due to frequent statsd
Timer
events, there is also an issue where flattening metric dimensions within the awsemfexporter
by using a metrics_declarations
directive can result in the flattened metrics being considered duplicates within the awsemfexporter
module itself and generating warnings. In that scenario, adjusting the statsdcollector
configuration isn't helpful because the duplication comes about later in the pipeline. The workaround for that issue is that it is usually more-reliable to flatten dimensions by using the metricstransformprocessor
module's aggregate_labels
action than to rely on awsemfexporter
to do it, however that cannot be applied to "summary" datatypes due to open-telemetry/opentelemetry-collector-contrib#6504
Here is an example of the metrics logstream content before applying timer_histogram_mapping
in the statsdcollector
:
2022-03-24T19:12:42.324-05:00 {
"_aws": {
"CloudWatchMetrics": [
{
"Namespace": "default",
"Dimensions": [
[
"action",
"cause"
],
[],
[
"action"
],
[
"cause"
]
],
"Metrics": [
{
"Name": "jvm.gc.pause"
}
]
}
],
"Timestamp": 1648167162324
},
"action": "end of major GC",
"cause": "System.gc()",
"jvm.gc.pause": 79
}
2022-03-24T19:12:52.309-05:00 {
"_aws": {
"CloudWatchMetrics": [
{
"Namespace": "default",
"Dimensions": [
[
"action",
"cause"
],
[],
[
"action"
],
[
"cause"
]
],
"Metrics": [
{
"Name": "jvm.gc.pause"
}
]
}
],
"Timestamp": 1648167172309
},
"action": "end of major GC",
"cause": "System.gc()",
"jvm.gc.pause": 63
}
2022-03-24T19:13:02.091-05:00 {
"_aws": {
"CloudWatchMetrics": [
{
"Namespace": "default",
"Dimensions": [
[
"action",
"cause"
],
[],
[
"action"
],
[
"cause"
]
],
"Metrics": [
{
"Name": "jvm.gc.pause"
}
]
}
],
"Timestamp": 1648167182091
},
"action": "end of major GC",
"cause": "System.gc()",
"jvm.gc.pause": 46
}
Versus after:
2022-03-25T16:32:03.808-05:00 {
"_aws": {
"CloudWatchMetrics": [
{
"Namespace": "default",
"Dimensions": [
[
"action",
"cause"
],
[],
[
"action"
],
[
"cause"
]
],
"Metrics": [
{
"Name": "jvm.gc.pause"
}
]
}
],
"Timestamp": 1648243923808
},
"action": "end of major GC",
"cause": "System.gc()",
"jvm.gc.pause": {
"Max": 62,
"Min": 56,
"Count": 3,
"Sum": 176
}
}
Hello, Any update on what is the cause of these Duplicate warning messages for springboot metrics sent via micrometer-statsd layer using ADOT standardard config.yaml available here - https://github.com/aws-observability/aws-otel-collector/blob/main/config/ecs/container-insights/otel-task-metrics-config.yaml
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.