opentelemetry-dotnet icon indicating copy to clipboard operation
opentelemetry-dotnet copied to clipboard

Prometheus exporter sometimes produces malformed data

Open hez2010 opened this issue 1 year ago • 4 comments

Bug Report

List of all OpenTelemetry NuGet packages and version that you are using (e.g. OpenTelemetry 1.0.2):

    <PackageVersion Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.8.0" />
    <PackageVersion Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.8.0-rc.1" />
    <PackageVersion Include="OpenTelemetry.Extensions.Hosting" Version="1.8.0" />
    <PackageVersion Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.8.0" />
    <PackageVersion Include="OpenTelemetry.Instrumentation.EntityFrameworkCore" Version="1.0.0-beta.11" />
    <PackageVersion Include="OpenTelemetry.Instrumentation.Http" Version="1.8.0" />
    <PackageVersion Include="OpenTelemetry.Instrumentation.Process" Version="0.5.0-beta.5" />
    <PackageVersion Include="OpenTelemetry.Instrumentation.Runtime" Version="1.8.0" />
    <PackageVersion Include="OpenTelemetry.Instrumentation.StackExchangeRedis" Version="1.0.0-rc9.14" />

Runtime version: net8.0

Symptom

Sometimes the exporter produces a malformed line:

dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameRe# TYPE otel_scope_info info

and causes Prometheus to fail:

expected label value, got "\"System.Net.NameRe# TYPE otel_scope_info info\n" ("INVALID") while parsing: "dns_lookup_duration_seconds_bucket{otel_scope_name=\"System.Net.NameRe# TYPE otel_scope_info info\n"

What is the expected behavior?

There should not be malformed data in the metrics.

# TYPE target info
# HELP target Target metadata
target_info{service_name="foo",service_instance_id="7272cdc6-1b69-4ee3-9bed-4005847686d5",telemetry_sdk_name="opentelemetry",telemetry_sdk_language="dotnet",telemetry_sdk_version="1.8.0"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="System.Net.NameResolution"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="OpenTelemetry.Instrumentation.Runtime"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="OpenTelemetry.Instrumentation.Process"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Server.Kestrel"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Hosting"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="System.Net.Http"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.RateLimiting"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Routing"} 1
# TYPE dns_lookup_duration_seconds histogram
# UNIT dns_lookup_duration_seconds seconds
# HELP dns_lookup_duration_seconds Measures the time taken to perform a DNS lookup.
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameResolution",dns_question_name="localhost",le="0.005"} 7 1712551059.936
...

What is the actual behavior?

# TYPE dns_lookup_duration_seconds histogram
# UNIT dns_lookup_duration_seconds seconds
# HELP dns_lookup_duration_seconds Measures the time taken to perform a DNS lookup.
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameRe# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="System.Net.NameResolution"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="OpenTelemetry.Instrumentation.Runtime"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="OpenTelemetry.Instrumentation.Process"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Server.Kestrel"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Hosting"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="System.Net.Http"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.RateLimiting"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Routing"} 1
# TYPE dns_lookup_duration_seconds histogram
# UNIT dns_lookup_duration_seconds seconds
# HELP dns_lookup_duration_seconds Measures the time taken to perform a DNS lookup.
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameResolution",dns_question_name="localhost",le="0.005"} 10 1712549381.922
...

Reproduce

Simply adding some instrumentations.

            var otel = services.AddOpenTelemetry();

            otel.ConfigureResource(resource => resource.AddService("foo"));

            otel.WithMetrics(metrics =>
            {
                metrics.AddAspNetCoreInstrumentation();
                metrics.AddHttpClientInstrumentation();
                metrics.AddRuntimeInstrumentation();
                metrics.AddProcessInstrumentation();

                metrics.AddPrometheusExporter(options =>
                {
                    options.DisableTotalNameSuffixForCounters = true;
                });
            });

            otel.WithTracing(tracing =>
            {
                tracing.AddAspNetCoreInstrumentation();
                tracing.AddHttpClientInstrumentation();
                tracing.AddEntityFrameworkCoreInstrumentation();
                tracing.AddRedisInstrumentation();
            });

            otel.UseOtlpExporter(...);

            // ...
            // var app = builder.Build();

            app.UseOpenTelemetryPrometheusScrapingEndpoint();

Additional Context

It looks like the buffer is not cleared correctly and results in some previous data remaining in the buffer:

# TYPE dns_lookup_duration_seconds histogram
# UNIT dns_lookup_duration_seconds seconds
# HELP dns_lookup_duration_seconds Measures the time taken to perform a DNS lookup.
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameRe

hez2010 avatar Apr 08 '24 04:04 hez2010

It seems the target info is being overwritten unexpectedly.

hez2010 avatar Apr 08 '24 04:04 hez2010

@hez2010 I've seen your MR's, looking forward to the fix as well. Out of curiosity: Besides the OP's issue the format also seems to be changed to add things like process_runtime_dotnet_gc_collections_count_total**{otel_scope_name="OpenTelemetry.Instrumentation.Runtime",otel_scope_version="1.8.0",generation="gen2"}** 3 1714993562978

Is that text between brackets expected by Prometheus ?

Perhaps you can decide if this has something to do with it as well: ts=2024-05-06T09:57:43.000Z caller=scrape.go:1415 level=debug component="scrape manager" scrape_pool=hrgt-t-api-metrics target=http://myserveraddress.svc.cluster.local:9090/metrics msg="Append failed" err="unit \"seconds\" not a suffix of metric \"process_cpu_time_seconds_total\""

Ruud-cb avatar May 06 '24 11:05 Ruud-cb

I think this issue is maybe what is causing https://github.com/open-telemetry/opentelemetry-dotnet-contrib/issues/1617?

samsp-msft avatar May 10 '24 23:05 samsp-msft

#5517 is trying to fix the bug.

reyang avatar May 13 '24 05:05 reyang