opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Health checks against the gRPC ports cause annoying log entries

Open mplachter opened this issue 3 years ago • 20 comments

Describe the bug

When upgrading from 0.37.0 to 0.42.0 getting spammed with the below log messages over and over.

2022-01-25T16:40:11.352Z warn zapgrpc/zapgrpc.go:191 [core] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: read tcp 192.168.2.184:14250->192.168.1.94:2793: read: connection reset by peer" {"grpc_log": true}

Steps to reproduce ~~Otel Collector Exporting OTLP to another Otel Collector Receiving OTLP with TLS insecure enabled.~~ Place the collector behind a load balancer doing periodical health checks, such as AKS'

What did you expect to see? Not a bunch of warn log messages over and over again.

What did you see instead? All the initialization up till Everything is ready. Begin running and processing data. was fine with only info messages until the warn messages started.

What version did you use? Version: v0.42.0

What config did you use?

config:
  exporters:
    otlp:
      endpoint: collector:4317
      tls:
        insecure: true
  extensions:
    health_check: {}
    memory_ballast: {}
  processors:
    attributes:
      actions:
        - key: deployment.environment
          value: dev
          action: upsert
        - key: env
          value: dev
          action: upsert
    resourcedetection:
      detectors: [env, azure]
      timeout: 5s
      override: false
    k8sattributes:
      passthrough: true
      extract:
        metadata:
          # extract the following well-known metadata fields from pods and namespaces
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.deployment.name
          - k8s.cluster.name
          - k8s.namespace.name
          - k8s.node.name
          - k8s.pod.start_time
    batch: {}
    memory_limiter: null
  receivers:
    jaeger:
      protocols:
        grpc:
          endpoint: 0.0.0.0:14250
        thrift_http:
          endpoint: 0.0.0.0:14268
        thrift_compact:
          endpoint: 0.0.0.0:6831
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    prometheus:
      config:
        scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
              - targets: ["${MY_POD_IP}:8888"]
                labels:
                  env: dev
    zipkin:
      endpoint: 0.0.0.0:9411
  service:
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      logs:
        exporters:
          - otlp
        processors:
          - memory_limiter
          - batch
        receivers:
          - otlp
      metrics:
        exporters:
          - otlp
        processors:
          - memory_limiter
          - batch
        receivers:
          - otlp
          - prometheus
      traces:
        exporters:
          - otlp
        processors:
          - memory_limiter
          - batch
          - resourcedetection
          - attributes
          - k8sattributes
        receivers:
          - otlp
          - jaeger
          - zipkin

Environment Running in k8s.

Additional context Using a older version of the otel collector (v0.40.0) exporting to the newer otel-collector (0.42.0) receiving otlp does not have this behavior. This makes me believe it just has to do with exporting OTLP to another collector on 0.42.0.

mplachter avatar Jan 25 '22 17:01 mplachter

@bogdandrutu, @tigrannajaryan, is this a supported case? Older collector sending data to a newer? I would think so, but perhaps you know of something that would break this.

jpkrohling avatar Jan 25 '22 18:01 jpkrohling

@jpkrohling the issue is actually newer to newer. Sorry for the confusion.

It works well from 0.37.0 and 0.40.0 to 0.42.0 but not 0.42.0 to 0.42.0 over otlp.

mplachter avatar Jan 25 '22 18:01 mplachter

That should definitely work.

jpkrohling avatar Jan 25 '22 18:01 jpkrohling

It should work with older->newer as well. Perhaps we broke something in 0.42.

tigrannajaryan avatar Jan 25 '22 19:01 tigrannajaryan

I'm trying to get this to work (by work i mean duplicate the warn) in isolation at the moment.

It could actually be something from the contrib and not related to otlp at all. Its hard to tell with the current warn message.

mplachter avatar Jan 25 '22 20:01 mplachter

Could you please confirm you are experiencing an actual issue? Is this only about the warning message? This doesn't seem related to OTLP at all, but rather, with the Jaeger Receiver trying to get the sampling strategies from localhost:14250:

read tcp 192.168.2.184:14250->192.168.1.94:2793

The port 14250 is Jaeger's. I'm digging into this now to see why this is now issuing a warning.

jpkrohling avatar Jan 27 '22 20:01 jpkrohling

@jpkrohling You're correct. I jumped the gun. This is actually from the jaeger and not anything to do with otlp.

We can close this as this relates to. https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/7410

mplachter avatar Jan 27 '22 20:01 mplachter

Just to confirm: you are not experiencing problems with data arriving via OTLP, right?

jpkrohling avatar Jan 28 '22 13:01 jpkrohling

I'm closing, but do reopen if you experience any odd behavior with OTLP.

jpkrohling avatar Jan 28 '22 13:01 jpkrohling

Hi,

I've got it for my OTLP receiver. My logs are spammed every 30s with this :

otel-collector | 2022-02-18T09:08:05.885Z warn zapgrpc/zapgrpc.go:191 [core] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: read tcp collector_ip_adress_1:4317->ip_adress_2:19993: read: connection reset by peer" {"grpc_log": true} Upgraded from 0.36 to 0.44. I did not have these warnings before 0.44.

My otlp receiver endpoint is exposed through a tcp vip. If i disable the vip the warnings disappear. I think it's when the vip's health checks if the collector is still running.

My config :

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  logging:
    logLevel: info
    sampling_initial: 1
    sampling_thereafter: 1000

  elastic/otr:
    apm_server_url: http://apm-server:8200

processors:
  batch:
  memory_limiter:
    #    ballast_size_mib: 500
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25
  resource:
    attributes:
      # until regexp supported remove unnecessary attributes (possibly sensitive data)
      - key: process.command_line
        action: delete
  probabilistic_sampler:
    sampling_percentage: $PROCESSOR_PROBABILISTIC_SAMPLING_PERCENTAGE

extensions:
  health_check:
  zpages:
    endpoint: :55679

service:
  extensions: [zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, probabilistic_sampler, batch, resource]
      exporters: [logging, elastic/otr]

trudyq avatar Feb 18 '22 09:02 trudyq

I think it's when the vip's health checks if the collector is still running.

This is quite possible, why do you have a health check happening on the OTLP gRPC port, and not on the health check port?

jpkrohling avatar Feb 18 '22 09:02 jpkrohling

My mistake ! I didn't check before and assumed it was the same as for a http vip. I have (and can't have) no health check on tcp vip.

But if i disable the vip the warnings disappear.

trudyq avatar Feb 18 '22 13:02 trudyq

I can confirm it is indeed the k8s service load balancer.

This was installed using the otel helm charts (Modified a little bit, but nothing to crazy in the service\endpoint config).

mplachter avatar Feb 18 '22 15:02 mplachter

@jpkrohling so currently there is no way to disable health_checks in AKS (Azure's managed K8s).

Maybe the setting that was toggled on for the GRPC logging can be a config setting so users can choose to toggle this off?

mplachter avatar Feb 18 '22 16:02 mplachter

@jpkrohling any chance for the above to be possible?

mplachter avatar Feb 22 '22 20:02 mplachter

Sorry for the delay, I haven't had the chance yet to look into this. I've added to my queue and will look soon (although unlikely to be this week).

jpkrohling avatar Feb 28 '22 16:02 jpkrohling

@jpkrohling sounds good no worries. Also there is no way to make it do http to the healthcheck endpoint instead of testing the gRPC ports. This is a limitation with AKS and the load balancer leveraged for the loadbalancer type for services.

https://github.com/Azure/AKS/issues/1394

mplachter avatar Feb 28 '22 18:02 mplachter

Given the latest info here, I'm reopening this issue and changing the original description (and title).

jpkrohling avatar Mar 02 '22 11:03 jpkrohling

I won't be able to look at this one in detail for the next two months. I'm therefore unassigning this. This is now up for grabs. If nothing happens in two months, I might be able to take a look again.

jpkrohling avatar Apr 26 '22 17:04 jpkrohling

I wasn't able to find a good solution on our side, so I opened a PR to change gRPC. It might take a while for this change to land here, but it eventually will.

jpkrohling avatar Jul 19 '22 21:07 jpkrohling

@jpkrohling the change has been landed in the 1.49 release of grpc-go a month back. Shall we close this issue?

sakshi1215 avatar Oct 14 '22 05:10 sakshi1215

Closing. If the messages are still being shown, please leave a comment here and I'll reopen this.

jpkrohling avatar Oct 18 '22 19:10 jpkrohling