opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

Health check extension returns 200 status code during errors

Open william-tran opened this issue 2 years ago • 5 comments

Describe the bug When I simulate exporter errors and use health check with check_collector_pipeline enabled, I get a response like

$ while true; do python3 test.py; curl -v localhost:13133; done

Traces cannot be uploaded; HTTP status code: 500, message: Internal Server Error
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 13133 (#0)
> GET / HTTP/1.1
> Host: localhost:13133
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Mon, 07 Mar 2022 17:06:40 GMT
< Content-Length: 0
<
* Connection #0 to host localhost left intact
* Closing connection 0

Steps to reproduce

run v0.46.0 with this config.yaml

receivers:
  jaeger:
    protocols:
      thrift_http:

extensions:
  health_check:
    check_collector_pipeline:
      enabled: false
      interval: 1s
      exporter_failure_threshold: 1

exporters:
  otlphttp:
    # should result in connection refused
    endpoint: "http://localhost:55555"
    retry_on_failure:
      enabled: true
    sending_queue:
      enabled: true
      queue_size: 10

service:
  extensions:
    - health_check
  pipelines:
    traces:
      receivers:
        - jaeger
      exporters:
        - otlphttp

And with this python script test.py:

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(
TracerProvider(
        resource=Resource.create({SERVICE_NAME: "my-helloworld-service"})
    )
)
tracer = trace.get_tracer(__name__)

# create a JaegerExporter
jaeger_exporter = JaegerExporter(
    collector_endpoint='http://localhost:14268/api/traces?format=jaeger.thrift',
)

# Create a BatchSpanProcessor and add the exporter to it
span_processor = BatchSpanProcessor(jaeger_exporter)

# add to the tracer
trace.get_tracer_provider().add_span_processor(span_processor)

with tracer.start_as_current_span("hello"):
    print("Hello world from OpenTelemetry Python!")

and requirements.txt

opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-jaeger-thrift
$ pip install -r requirements.txt

execute in a loop until you see errors:

$ while true; do python3 test.py; curl -v localhost:13133; done

What did you expect to see? Health check eventually responds with a 5xx status code

What did you see instead? Health check always responds with a 200 status code

What version did you use? 0.46.0

What config did you use? See above

Environment OS: locally tested on OSX

william-tran avatar Mar 07 '22 17:03 william-tran

I'm assigning this to myself as I'm the code owner, but I believe we didn't implement yet the reporting of the state of individual components.

jpkrohling avatar Mar 07 '22 17:03 jpkrohling

@jpkrohling sorry this might be a red herring, when I use interval: 1m instead, it eventually returns 500, but after a minute it reverts back to 200.

william-tran avatar Mar 07 '22 17:03 william-tran

More context: when running a traces exporter like otlp or kafka, sometimes the TCP connection dies, but there is no built-in connection restart, so the exporter queue starts filling up. I want to restart otel-collector to reestablish connections. Ideally this would be done before data loss occurs when you hit exporter queue capacity. Exposing this as a metric in https://github.com/open-telemetry/opentelemetry-collector/issues/4902 and then configuring a percent of capacity threshold for health check failure like "signal unhealthy when capacity reaches 95%" would be a way to prevent data loss.

william-tran avatar Mar 07 '22 17:03 william-tran

I report the same issue in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/11780, with more technical details (e.g. explaining why initially HC serves status 500, but after a minute revert to 200).

ItsLastDay avatar Jul 01 '22 14:07 ItsLastDay

Pinging code owners: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Sep 16 '22 17:09 github-actions[bot]