opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
Health check extension returns 200 status code during errors
Describe the bug
When I simulate exporter errors and use health check with check_collector_pipeline
enabled, I get a response like
$ while true; do python3 test.py; curl -v localhost:13133; done
Traces cannot be uploaded; HTTP status code: 500, message: Internal Server Error
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 13133 (#0)
> GET / HTTP/1.1
> Host: localhost:13133
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Mon, 07 Mar 2022 17:06:40 GMT
< Content-Length: 0
<
* Connection #0 to host localhost left intact
* Closing connection 0
Steps to reproduce
run v0.46.0 with this config.yaml
receivers:
jaeger:
protocols:
thrift_http:
extensions:
health_check:
check_collector_pipeline:
enabled: false
interval: 1s
exporter_failure_threshold: 1
exporters:
otlphttp:
# should result in connection refused
endpoint: "http://localhost:55555"
retry_on_failure:
enabled: true
sending_queue:
enabled: true
queue_size: 10
service:
extensions:
- health_check
pipelines:
traces:
receivers:
- jaeger
exporters:
- otlphttp
And with this python script test.py:
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
trace.set_tracer_provider(
TracerProvider(
resource=Resource.create({SERVICE_NAME: "my-helloworld-service"})
)
)
tracer = trace.get_tracer(__name__)
# create a JaegerExporter
jaeger_exporter = JaegerExporter(
collector_endpoint='http://localhost:14268/api/traces?format=jaeger.thrift',
)
# Create a BatchSpanProcessor and add the exporter to it
span_processor = BatchSpanProcessor(jaeger_exporter)
# add to the tracer
trace.get_tracer_provider().add_span_processor(span_processor)
with tracer.start_as_current_span("hello"):
print("Hello world from OpenTelemetry Python!")
and requirements.txt
opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-jaeger-thrift
$ pip install -r requirements.txt
execute in a loop until you see errors:
$ while true; do python3 test.py; curl -v localhost:13133; done
What did you expect to see? Health check eventually responds with a 5xx status code
What did you see instead? Health check always responds with a 200 status code
What version did you use? 0.46.0
What config did you use? See above
Environment OS: locally tested on OSX
I'm assigning this to myself as I'm the code owner, but I believe we didn't implement yet the reporting of the state of individual components.
@jpkrohling sorry this might be a red herring, when I use interval: 1m
instead, it eventually returns 500, but after a minute it reverts back to 200.
More context: when running a traces exporter like otlp or kafka, sometimes the TCP connection dies, but there is no built-in connection restart, so the exporter queue starts filling up. I want to restart otel-collector to reestablish connections. Ideally this would be done before data loss occurs when you hit exporter queue capacity. Exposing this as a metric in https://github.com/open-telemetry/opentelemetry-collector/issues/4902 and then configuring a percent of capacity threshold for health check failure like "signal unhealthy when capacity reaches 95%" would be a way to prevent data loss.
I report the same issue in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/11780, with more technical details (e.g. explaining why initially HC serves status 500, but after a minute revert to 200).
Pinging code owners: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.