opentelemetry-collector
opentelemetry-collector copied to clipboard
Health checks against the gRPC ports cause annoying log entries
Describe the bug
When upgrading from 0.37.0 to 0.42.0 getting spammed with the below log messages over and over.
2022-01-25T16:40:11.352Z warn zapgrpc/zapgrpc.go:191 [core] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: read tcp 192.168.2.184:14250->192.168.1.94:2793: read: connection reset by peer" {"grpc_log": true}
Steps to reproduce ~~Otel Collector Exporting OTLP to another Otel Collector Receiving OTLP with TLS insecure enabled.~~ Place the collector behind a load balancer doing periodical health checks, such as AKS'
What did you expect to see? Not a bunch of warn log messages over and over again.
What did you see instead?
All the initialization up till Everything is ready. Begin running and processing data.
was fine with only info
messages until the warn
messages started.
What version did you use? Version: v0.42.0
What config did you use?
config:
exporters:
otlp:
endpoint: collector:4317
tls:
insecure: true
extensions:
health_check: {}
memory_ballast: {}
processors:
attributes:
actions:
- key: deployment.environment
value: dev
action: upsert
- key: env
value: dev
action: upsert
resourcedetection:
detectors: [env, azure]
timeout: 5s
override: false
k8sattributes:
passthrough: true
extract:
metadata:
# extract the following well-known metadata fields from pods and namespaces
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.cluster.name
- k8s.namespace.name
- k8s.node.name
- k8s.pod.start_time
batch: {}
memory_limiter: null
receivers:
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_http:
endpoint: 0.0.0.0:14268
thrift_compact:
endpoint: 0.0.0.0:6831
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: opentelemetry-collector
scrape_interval: 10s
static_configs:
- targets: ["${MY_POD_IP}:8888"]
labels:
env: dev
zipkin:
endpoint: 0.0.0.0:9411
service:
extensions:
- health_check
- memory_ballast
pipelines:
logs:
exporters:
- otlp
processors:
- memory_limiter
- batch
receivers:
- otlp
metrics:
exporters:
- otlp
processors:
- memory_limiter
- batch
receivers:
- otlp
- prometheus
traces:
exporters:
- otlp
processors:
- memory_limiter
- batch
- resourcedetection
- attributes
- k8sattributes
receivers:
- otlp
- jaeger
- zipkin
Environment Running in k8s.
Additional context
Using a older version of the otel collector (v0.40.0) exporting to the newer otel-collector (0.42.0) receiving otlp does not have this behavior. This makes me believe it just has to do with exporting
OTLP to another collector on 0.42.0
.
@bogdandrutu, @tigrannajaryan, is this a supported case? Older collector sending data to a newer? I would think so, but perhaps you know of something that would break this.
@jpkrohling the issue is actually newer to newer. Sorry for the confusion.
It works well from 0.37.0 and 0.40.0 to 0.42.0 but not 0.42.0 to 0.42.0 over otlp.
That should definitely work.
It should work with older->newer as well. Perhaps we broke something in 0.42.
I'm trying to get this to work (by work i mean duplicate the warn) in isolation at the moment.
It could actually be something from the contrib
and not related to otlp
at all. Its hard to tell with the current warn message.
Could you please confirm you are experiencing an actual issue? Is this only about the warning message? This doesn't seem related to OTLP at all, but rather, with the Jaeger Receiver trying to get the sampling strategies from localhost:14250
:
read tcp 192.168.2.184:14250->192.168.1.94:2793
The port 14250 is Jaeger's. I'm digging into this now to see why this is now issuing a warning.
@jpkrohling You're correct.
I jumped the gun.
This is actually from the jaeger and not anything to do with otlp
.
We can close this as this relates to. https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/7410
Just to confirm: you are not experiencing problems with data arriving via OTLP, right?
I'm closing, but do reopen if you experience any odd behavior with OTLP.
Hi,
I've got it for my OTLP receiver. My logs are spammed every 30s with this :
otel-collector | 2022-02-18T09:08:05.885Z warn zapgrpc/zapgrpc.go:191 [core] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: read tcp collector_ip_adress_1:4317->ip_adress_2:19993: read: connection reset by peer" {"grpc_log": true}
Upgraded from 0.36 to 0.44. I did not have these warnings before 0.44.
My otlp receiver endpoint is exposed through a tcp vip. If i disable the vip the warnings disappear. I think it's when the vip's health checks if the collector is still running.
My config :
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
logging:
logLevel: info
sampling_initial: 1
sampling_thereafter: 1000
elastic/otr:
apm_server_url: http://apm-server:8200
processors:
batch:
memory_limiter:
# ballast_size_mib: 500
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
resource:
attributes:
# until regexp supported remove unnecessary attributes (possibly sensitive data)
- key: process.command_line
action: delete
probabilistic_sampler:
sampling_percentage: $PROCESSOR_PROBABILISTIC_SAMPLING_PERCENTAGE
extensions:
health_check:
zpages:
endpoint: :55679
service:
extensions: [zpages, health_check]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, probabilistic_sampler, batch, resource]
exporters: [logging, elastic/otr]
I think it's when the vip's health checks if the collector is still running.
This is quite possible, why do you have a health check happening on the OTLP gRPC port, and not on the health check port?
My mistake ! I didn't check before and assumed it was the same as for a http vip. I have (and can't have) no health check on tcp vip.
But if i disable the vip the warnings disappear.
I can confirm it is indeed the k8s
service load balancer.
This was installed using the otel
helm charts (Modified a little bit, but nothing to crazy in the service\endpoint config).
@jpkrohling so currently there is no way to disable health_checks
in AKS (Azure's managed K8s).
Maybe the setting that was toggled on for the GRPC logging can be a config setting so users can choose to toggle this off?
@jpkrohling any chance for the above to be possible?
Sorry for the delay, I haven't had the chance yet to look into this. I've added to my queue and will look soon (although unlikely to be this week).
@jpkrohling sounds good no worries. Also there is no way to make it do http to the healthcheck endpoint instead of testing the gRPC ports. This is a limitation with AKS and the load balancer leveraged for the loadbalancer type for services.
https://github.com/Azure/AKS/issues/1394
Given the latest info here, I'm reopening this issue and changing the original description (and title).
I won't be able to look at this one in detail for the next two months. I'm therefore unassigning this. This is now up for grabs. If nothing happens in two months, I might be able to take a look again.
I wasn't able to find a good solution on our side, so I opened a PR to change gRPC. It might take a while for this change to land here, but it eventually will.
@jpkrohling the change has been landed in the 1.49 release of grpc-go a month back. Shall we close this issue?
Closing. If the messages are still being shown, please leave a comment here and I'll reopen this.