Prometheus exporter sometimes produces malformed data
Bug Report
List of all OpenTelemetry NuGet packages and version that you are using (e.g. OpenTelemetry 1.0.2):
<PackageVersion Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.8.0" />
<PackageVersion Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.8.0-rc.1" />
<PackageVersion Include="OpenTelemetry.Extensions.Hosting" Version="1.8.0" />
<PackageVersion Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.8.0" />
<PackageVersion Include="OpenTelemetry.Instrumentation.EntityFrameworkCore" Version="1.0.0-beta.11" />
<PackageVersion Include="OpenTelemetry.Instrumentation.Http" Version="1.8.0" />
<PackageVersion Include="OpenTelemetry.Instrumentation.Process" Version="0.5.0-beta.5" />
<PackageVersion Include="OpenTelemetry.Instrumentation.Runtime" Version="1.8.0" />
<PackageVersion Include="OpenTelemetry.Instrumentation.StackExchangeRedis" Version="1.0.0-rc9.14" />
Runtime version: net8.0
Symptom
Sometimes the exporter produces a malformed line:
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameRe# TYPE otel_scope_info info
and causes Prometheus to fail:
expected label value, got "\"System.Net.NameRe# TYPE otel_scope_info info\n" ("INVALID") while parsing: "dns_lookup_duration_seconds_bucket{otel_scope_name=\"System.Net.NameRe# TYPE otel_scope_info info\n"
What is the expected behavior?
There should not be malformed data in the metrics.
# TYPE target info
# HELP target Target metadata
target_info{service_name="foo",service_instance_id="7272cdc6-1b69-4ee3-9bed-4005847686d5",telemetry_sdk_name="opentelemetry",telemetry_sdk_language="dotnet",telemetry_sdk_version="1.8.0"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="System.Net.NameResolution"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="OpenTelemetry.Instrumentation.Runtime"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="OpenTelemetry.Instrumentation.Process"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Server.Kestrel"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Hosting"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="System.Net.Http"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.RateLimiting"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Routing"} 1
# TYPE dns_lookup_duration_seconds histogram
# UNIT dns_lookup_duration_seconds seconds
# HELP dns_lookup_duration_seconds Measures the time taken to perform a DNS lookup.
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameResolution",dns_question_name="localhost",le="0.005"} 7 1712551059.936
...
What is the actual behavior?
# TYPE dns_lookup_duration_seconds histogram
# UNIT dns_lookup_duration_seconds seconds
# HELP dns_lookup_duration_seconds Measures the time taken to perform a DNS lookup.
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameRe# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="System.Net.NameResolution"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="OpenTelemetry.Instrumentation.Runtime"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="OpenTelemetry.Instrumentation.Process"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Server.Kestrel"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Hosting"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="System.Net.Http"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.RateLimiting"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="Microsoft.AspNetCore.Routing"} 1
# TYPE dns_lookup_duration_seconds histogram
# UNIT dns_lookup_duration_seconds seconds
# HELP dns_lookup_duration_seconds Measures the time taken to perform a DNS lookup.
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameResolution",dns_question_name="localhost",le="0.005"} 10 1712549381.922
...
Reproduce
Simply adding some instrumentations.
var otel = services.AddOpenTelemetry();
otel.ConfigureResource(resource => resource.AddService("foo"));
otel.WithMetrics(metrics =>
{
metrics.AddAspNetCoreInstrumentation();
metrics.AddHttpClientInstrumentation();
metrics.AddRuntimeInstrumentation();
metrics.AddProcessInstrumentation();
metrics.AddPrometheusExporter(options =>
{
options.DisableTotalNameSuffixForCounters = true;
});
});
otel.WithTracing(tracing =>
{
tracing.AddAspNetCoreInstrumentation();
tracing.AddHttpClientInstrumentation();
tracing.AddEntityFrameworkCoreInstrumentation();
tracing.AddRedisInstrumentation();
});
otel.UseOtlpExporter(...);
// ...
// var app = builder.Build();
app.UseOpenTelemetryPrometheusScrapingEndpoint();
Additional Context
It looks like the buffer is not cleared correctly and results in some previous data remaining in the buffer:
# TYPE dns_lookup_duration_seconds histogram
# UNIT dns_lookup_duration_seconds seconds
# HELP dns_lookup_duration_seconds Measures the time taken to perform a DNS lookup.
dns_lookup_duration_seconds_bucket{otel_scope_name="System.Net.NameRe
It seems the target info is being overwritten unexpectedly.
@hez2010 I've seen your MR's, looking forward to the fix as well. Out of curiosity: Besides the OP's issue the format also seems to be changed to add things like process_runtime_dotnet_gc_collections_count_total**{otel_scope_name="OpenTelemetry.Instrumentation.Runtime",otel_scope_version="1.8.0",generation="gen2"}** 3 1714993562978
Is that text between brackets expected by Prometheus ?
Perhaps you can decide if this has something to do with it as well:
ts=2024-05-06T09:57:43.000Z caller=scrape.go:1415 level=debug component="scrape manager" scrape_pool=hrgt-t-api-metrics target=http://myserveraddress.svc.cluster.local:9090/metrics msg="Append failed" err="unit \"seconds\" not a suffix of metric \"process_cpu_time_seconds_total\""
I think this issue is maybe what is causing https://github.com/open-telemetry/opentelemetry-dotnet-contrib/issues/1617?
#5517 is trying to fix the bug.