opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

[healthcheckextensionv2] Introduce health check extension based on component status reporting

Open mwear opened this issue 1 year ago • 32 comments
trafficstars

Description: This PR introduces a candidate to replace the current health check extension. This extension is based on component status reporting and provides insight into collector health at the pipeline and component level. It includes an HTTP service and a gRPC service. The HTTP service has a status endpoint that provides the ability to probe the overall collector health and the health of individual pipelines. It also provides a config endpoint to obtain the config of the currently running collector.

The gRPC service is an implementation of the grpc_health_v1 service. While not as detailed as the HTTP service, it provides the ability to check the health of the collector overall and the health of individual pipelines via a unary or streaming RPC.

Looking at the README in this PR is a good place to start to understand the changes.

I have temporarily named this extension healthcheckv2. We can discuss the best way to integrate this work going forward, but my reasoning for using a different name is to facilitate the code review (the diff would be hard to follow if we were to overwrite the current extension). If we decide to move forward with this extension and replace the existing one, we can handle the deletion of the old extension and the renaming in a followup PR.

Ultimately this extension will need some form of one of the following before it will be a suitable replacement:

  • https://github.com/open-telemetry/opentelemetry-collector/pull/8684
  • https://github.com/open-telemetry/opentelemetry-collector/pull/8788

The more components that adopt status reporting, the more useful this extension will be.

Link to tracking Issue: #26661

Testing: Units, Manual

Documentation: README

mwear avatar Jan 19 '24 03:01 mwear

Thanks for all this work @mwear!

Ultimately this extension will need some form of one of the following before it will be a suitable replacement:

Can you elaborate on why that's needed? My understanding of the current healthcheck extension is that it doesn't actually work for checking the health of pipelines (see issue: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/11780 and the readme warning: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension#health-check). Is the implementation in this PR not already suitable as a replacement in that case?

I'm asking to better understanding if the PRs listed are necessary before moving this work forward.

codeboten avatar Jan 19 '24 17:01 codeboten

As you point out, there are currently some issues with the accuracy of the current health check extension. Beyond this, we now have component status reporting as a collector feature, which allows us to report status of individual components. The overall collector health and pipeline health can be derived by looking at the health of the underlying components. All components have a basic level of automated status reporting, but there are nuanced failure and recovery cases that need to be handled on a case by case basis. That is, some components will need to do some level of manual status reporting for their health to be observable. The two PRs I mentioned add manual status reporting to the core exporters based on error conditions they can encounter. They attempt to identify the RecoverableError conditions, when and if they have recovered, and the PermanentError conditions that will require human intervention to fix. Without this information, the health check extension will be able to tell you very little about the health of your exporters. You will be able to know if they are running, or not, at that's all. tl;dr: we need to add status reporting to components that we care about for purposes of collector health. Exporters are the highest priority, then receivers and processors

mwear avatar Jan 19 '24 18:01 mwear

You will be able to know if they are running, or not, at that's all.

Fair enough, though I would say at this stage, it's more than the current healthcheck extension's capabilities

codeboten avatar Jan 19 '24 22:01 codeboten

You will be able to know if they are running, or not, at that's all.

Fair enough, though I would say at this stage, it's more than the current healthcheck extension's capabilities

I agree. I suppose we can classify those PRs as nice to have in the short term and must have in the longer term.

mwear avatar Jan 19 '24 22:01 mwear

@mwear i've tried your healhcheck-v2 code to build my own otel collector (v0.92.0) build failing with following errors:

280.3 Error: failed to compile the OpenTelemetry Collector distribution: go subcommand failed with args '[build -trimpath -o otelcol -ldflags=-s -w]': exit status 1. Output:
280.3 # go.opentelemetry.io/collector/exporter/exporterhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:65:55: undefined: obsmetrics.TagKeyExporter
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:231:28: undefined: obsmetrics.ExporterSentSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:232:30: undefined: obsmetrics.ExporterFailedToSendSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:234:28: undefined: obsmetrics.ExporterSentMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:235:30: undefined: obsmetrics.ExporterFailedToSendMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:237:28: undefined: obsmetrics.ExporterSentLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:238:30: undefined: obsmetrics.ExporterFailedToSendLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:289:30: undefined: obsmetrics.ExporterFailedToEnqueueSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:291:30: undefined: obsmetrics.ExporterFailedToEnqueueMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:293:30: undefined: obsmetrics.ExporterFailedToEnqueueLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/obsexporter.go:293:30: too many errors
280.3 # go.opentelemetry.io/collector/processor/processorhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:77:58: undefined: obsmetrics.TagKeyProcessor
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:192:32: undefined: obsmetrics.ProcessorAcceptedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:193:31: undefined: obsmetrics.ProcessorRefusedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:194:31: undefined: obsmetrics.ProcessorDroppedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:196:32: undefined: obsmetrics.ProcessorAcceptedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:197:31: undefined: obsmetrics.ProcessorRefusedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:198:31: undefined: obsmetrics.ProcessorDroppedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:200:32: undefined: obsmetrics.ProcessorAcceptedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:201:31: undefined: obsmetrics.ProcessorRefusedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:202:31: undefined: obsmetrics.ProcessorDroppedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/processorhelper/obsreport.go:202:31: too many errors
280.3 # go.opentelemetry.io/collector/receiver/receiverhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:76:26: undefined: obsmetrics.TagKeyReceiver
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:77:26: undefined: obsmetrics.TagKeyTransport
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:306:32: undefined: obsmetrics.ReceiverAcceptedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:307:31: undefined: obsmetrics.ReceiverRefusedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:309:32: undefined: obsmetrics.ReceiverAcceptedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:310:31: undefined: obsmetrics.ReceiverRefusedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:312:32: undefined: obsmetrics.ReceiverAcceptedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/receiverhelper/obsreport.go:313:31: undefined: obsmetrics.ReceiverRefusedLogRecords
280.3 
------
failed to solve: process "/bin/sh -c builder --config builder-config.yaml" did not complete successfully: exit code: 1

here is my build-config.yaml

dist:
  name: otelcol
  description: Custom OTel Collector distribution
  output_path: ./otelcol-dev
  otelcol_version: 0.92.0

extensions:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckextensionv2 v0.92.0

exporters:
  - gomod: go.opentelemetry.io/collector/exporter/otlphttpexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/debugexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/loggingexporter v0.92.0

  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/fileexporter v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/splunkhecexporter v0.92.0

processors:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/filterprocessor v0.92.0
  - gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.92.0

receivers:
  - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/filelogreceiver v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.92.0

replaces:
  - github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckextensionv2 => ../healthcheckextensionv2

i am using below builder version go.opentelemetry.io/collector/cmd/[email protected]

help..

vynu avatar Jan 24 '24 18:01 vynu

@vynu just updated your comment to make the error messages more readable :)

codeboten avatar Jan 24 '24 22:01 codeboten

thanks @codeboten i've been trying that .. but for some reason i am seeing error while editing ..

vynu avatar Jan 24 '24 22:01 vynu

@vynu use v0.93.0 for the otelcol_version and other collector dependencies.

mwear avatar Jan 24 '24 22:01 mwear

thanks for the suggestion @mwear i've successful build and tested on my laptop and also in AWS fargate .. worked just fine ..

one interesting error i see in cloudwatch logs:

loopyWriter exiting with error: transport closed by client\t{\"grpc_log\": true}

this happening whenever ALB performs health check calls on tasks .. i know ALB target group Grpc health check are little weird .. any suggestion on this ?

vynu avatar Jan 25 '24 10:01 vynu

@mwear, is this ready for review? There were some discussions during yesterday's SIG call, but I'm not sure if the state of the PR reflects what was discussed yesterday. If it does, let me know, and I'll review this PR.

jpkrohling avatar Jan 25 '24 14:01 jpkrohling

My takeaway from the SIG meeting is that folks would like pipeline health monitoring to be optional. I plan to update this PR to make it optional, but right now, it's the only option. If you want to review this PR with that in mind, feel free. I will let everyone know when the additional functionality is implemented.

mwear avatar Jan 25 '24 18:01 mwear

thanks for the suggestion @mwear i've successful build and tested on my laptop and also in AWS fargate .. worked just fine ..

one interesting error i see in cloudwatch logs:

loopyWriter exiting with error: transport closed by client\t{\"grpc_log\": true}

this happening whenever ALB performs health check calls on tasks .. i know ALB target group Grpc health check are little weird .. any suggestion on this ?

@vynu, I can look into this. Do you happen to know if ALB uses the Check or Watch RPC?

mwear avatar Jan 25 '24 22:01 mwear

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Feb 13 '24 05:02 github-actions[bot]

Hey all, I really need to use the health check extension for K8s environment but can't use it due to #11780 & I'm wondering when will be the health_checkv2 be available with otel collector.

pratik2294 avatar Feb 20 '24 11:02 pratik2294

@pratik2294, I am actively working on this. Below is a summary of what needs to be done. The first point is pretty close to being ready.

  • Make opting in to component health optional
    • This is analogous to the check_collector_pipeline
    • There will be possible to opt-in to recoverable errors only, permanent errors only, or both
  • Rather than introduce this as v2 there is a desire to replace the current extension with minimal friction, this is doable, but the details need to be decided on.
  • In order to monitor exporter failures some version of the following will need to merge:
    • https://github.com/open-telemetry/opentelemetry-collector/pull/8684
    • https://github.com/open-telemetry/opentelemetry-collector/pull/8788

mwear avatar Feb 20 '24 22:02 mwear

This PR has been updated and the latest round of improvements includes:

  • Support for v1 config and responses
    • This extension should be a seamless replacement for the existing one
    • We will need to figure out merge / rename or other strategy for integration
    • The check_collector_pipeline config remains for backwards compatibility, but there is no functionality backing it. Users are encouraged to use v2 for similar functionality.
  • By default error status events are not factored into the health check
    • An aggregate status is always available in the response body, but not reflected in the response code unless explicitly enabled
    • Users can opt-in to recoverable errors, permanent errors, or both.

mwear avatar Mar 02 '24 01:03 mwear

This PR LGTM and can be merged as is, nothing I commented before is blocking and the accepted suggestions can be made in follow-up PRs.

jpkrohling avatar Mar 06 '24 11:03 jpkrohling

@mwear When can this PR be merged?

valageri avatar Mar 17 '24 03:03 valageri

Thanks @mwear for the extensive work to get us here. Now all checks have passed and code approved, could the branch be merged? @SurenNihalani @codeboten @yurishkuro

gjshao44 avatar Mar 26 '24 15:03 gjshao44

We discussed next steps for this PR at the collector SIG on Wednesday. We know we want to replace the existing extension with this updated version, but there was a question as to whether we do that directly, or do it in steps. We decided we will do this in steps. We'll integrate this as healthcheckv2 and when we're confident in it, we will replace the existing extension and delete healthcheckv2. healthcheckv2 will be in development stability for its lifetime and interested parties can use the collector builder to test it out. The next point of discussion was the size of this PR. This needs more approvals before it can merge. If nobody is willing to review it in its current state, then I will have to split this up into smaller pieces to facilitate the review process. I'd like to know if there is any interest in reviewing this as-is before heading down that path.

mwear avatar Mar 29 '24 18:03 mwear

I've attempted to test this in production via https://github.com/scorecard-ai/scorecard-otel-collector, (which uses a fork of your fork, here: https://github.com/scorecard-ai/opentelemetry-collector-contrib/tree/healthcheck-v2).

Here are a few things I noticed.

  1. It appears that status_time is ending up as a fixed value. I originally expected to see uptime from the original health_check.
[ec2-user@ip-172-31-NN-NNN ~]$ curl http://0.0.0.0:13133
{"start_time":"2024-04-03T11:42:45.211686482Z","healthy":true,"status":"StatusOK","status_time":"2024-04-03T11:42:45.572109156Z"}
[ec2-user@ip-172-31-NN-NNN ~]$ curl http://0.0.0.0:13133
{"start_time":"2024-04-03T11:42:45.211686482Z","healthy":true,"status":"StatusOK","status_time":"2024-04-03T11:42:45.572109156Z"}
[ec2-user@ip-172-31-NN-NNN ~]$ curl http://0.0.0.0:13133
{"start_time":"2024-04-03T11:42:45.211686482Z","healthy":true,"status":"StatusOK","status_time":"2024-04-03T11:42:45.572109156Z"}
[ec2-user@ip-172-31-NN-NNN ~]$ curl http://0.0.0.0:13133
{"start_time":"2024-04-03T11:42:45.211686482Z","healthy":true,"status":"StatusOK","status_time":"2024-04-03T11:42:45.572109156Z"}
[ec2-user@ip-172-31-NN-NNN ~]$ curl http://0.0.0.0:13133
  1. ~Not specifying path: "/" for http results in the server sending 404s for root, not the status.~ It's at /status.

  2. GRPC calls are failing, but this could be my fault where I don't know what I'm supposed to be doing. (No real documentation for configuring this.)

[ec2-user@ip-172-31-NN-NNN ~]$ grpcurl -vv -plaintext 0.0.0.0:13132 grpc.health.v1.Health/Check 
Error invoking method "grpc.health.v1.Health/Check": failed to query for service descriptor "grpc.health.v1.Health": server does not support the reflection API
  1. http.status.detailed: false throws a marshaling error.

My config is this:

extensions:
  healthcheckv2:
    use_v2: true
    grpc:
      endpoint: "0.0.0.0:13132"
      transport: "tcp"
    http:
      endpoint: "0.0.0.0:13133"
      status:
        enabled: true
      config:
        enabled: false

(I can share access to this server if needed to repro, drop me an email.)

nathanhammond avatar Apr 03 '24 12:04 nathanhammond

@nathanhammond, thank you so much for trying this out and providing feedback. I have answers for most of your questions below.

It appears that status_time is ending up as a fixed value. I originally expected to see uptime from the original health_check.

status_time is the time of the underlying status event generated by calls to the component status API. In general, this is not a fixed time, but right now status is only being reported during startup and shutdown. The value you are seeing is the time of the StatusOK event during startup. In the future, components will report runtime status and the timestamp will update on changes in status. Uptime can be computed by subtracting start_time from now.

GRPC calls are failing, but this could be my fault where I don't know what I'm supposed to be doing. (No real documentation for configuring this.)

Reflection is not enabled on the GRPC server, so you will need to point grpcurl to the protos. You can clone: https://github.com/grpc/grpc then use

grpcurl -import-path grpc/src/proto/grpc/health/v1 -proto health.proto -plaintext -d '{"service":""}' localhost:13132 grpc.health.v1.Health.Check
or
grpcurl -import-path grpc/src/proto/grpc/health/v1 -proto health.proto -plaintext -d '{"service":""}' localhost:13132 grpc.health.v1.Health.Watch

http.status.detailed: false throws a marshaling error.

This was a bug on the readme. I've fixed it. Thanks for pointing it out. For this behavior you can pass verbose as a query parameter to the http service. For example: http://0.0.0.0:13133/status?verbose.

mwear avatar Apr 03 '24 21:04 mwear

In my current use case I am trying to enable health checks from AWS's Application Load Balancer and (in spite of it being configured "correctly" to the best of my knowledge) I can't get either of the HTTP or gRPC health checks to respond in a way that the load balancer is happy keeping the service in rotation.

I'm going to put more tracing in place to figure out why it's not working, but I suspect the gRPC request from the ALB is not going to match Google's pattern, so we'll see how that goes.

(that might become a feature request, but I can always add it to my fork)

nathanhammond avatar Apr 04 '24 14:04 nathanhammond

I'd be interested to know what you find and definitely happy to take feature requests to address in follow up PRs. This PR has been a victim of scope creep and is likely going to have to be broken into smaller PRs to facilitate reviews.

mwear avatar Apr 04 '24 16:04 mwear

@vynu were you able to get this working in AWS when you tested it? If so, do you have any suggestions for @nathanhammond?

mwear avatar Apr 04 '24 18:04 mwear

The update here is I have HTTP working, but gRPC doesn't respond to the AWS ALB request. I'll prep that in my fork so we can at least see what it looks like.

nathanhammond avatar Apr 05 '24 06:04 nathanhammond

Last update: I now have this extension working behind an AWS ALB and correctly responding to both HTTP and gRPC healthchecks. Review: 10/10. Will be shipped to (low-traffic) production on Monday. (Already "in prod", just not in use.)


import * as elb from 'aws-cdk-lib/aws-elasticloadbalancingv2';

// GRPC requests.
const grpcListener = loadBalancer.addListener('otlp-grpc', {
  port: 4317,
  protocol: elb.ApplicationProtocol.HTTPS,
  open: true
});
grpcListener.addCertificates(`otlp-grpc-certificate`, [certificate]);

grpcListener.addTargets('otlp-grpc-target', {
  port: 4317,
  protocol: elb.ApplicationProtocol.HTTP,
  protocolVersion: elb.ApplicationProtocolVersion.GRPC,
  targets: [autoScalingGroup],
  healthCheck: {
    protocol: elb.Protocol.HTTP,
    port: '13132',
    path: '/grpc.health.v1.Health/Check',
    healthyGrpcCodes: '0'
  }
});

// HTTP requests.
const httpsListener = loadBalancer.addListener('otlp-http', {
  port: 4318,
  protocol: elb.ApplicationProtocol.HTTPS,
  open: true
});
httpsListener.addCertificates(`otlp-http-certificate`, [certificate]);

httpsListener.addTargets('otlp-http-target', {
  port: 4318,
  protocol: elb.ApplicationProtocol.HTTP,
  protocolVersion: elb.ApplicationProtocolVersion.HTTP1,
  targets: [autoScalingGroup],
  healthCheck: {
    protocol: elb.Protocol.HTTP,
    port: '13133',
    path: '/status',
    healthyHttpCodes: '200'
  }
});
extensions:
  healthcheckv2:
    use_v2: true
    grpc:
      endpoint: "0.0.0.0:13132"
      transport: "tcp"
    http:
      endpoint: "0.0.0.0:13133"
      status:
        enabled: true
      config:
        enabled: false

nathanhammond avatar Apr 05 '24 08:04 nathanhammond

Hi, I am looking forward to this improvement.

Do you have an ETA when it will be released? ... thanks

klyalldr avatar Apr 23 '24 13:04 klyalldr

@klyalldr, I will bring this up at the collector SIG tomorrow. I've been asking for reviews from approvers, but there is resistance due to the size of this PR. I have been asked to break it up. The first chunk is here: #32523.

mwear avatar Apr 23 '24 15:04 mwear

@mwear, thanks for the tremendous work so far, looking forward to the new chunks getting sorted out

gjshao44 avatar Apr 25 '24 03:04 gjshao44