Need an indication on how the LoadBalancingExporter works with a static list of hosts when a host is down
Component(s)
exporter/loadbalancing
Describe the issue you're reporting
With a static configuration of hosts in the LoadBalancingExporter configuration:
exporters:
loadbalancing:
routing_key: traceID
protocol:
otlp:
tls:
insecure: true
resolver:
static:
hostnames:
- host1:4317
- host2:4317
If host2 has its Collector stopped, it seems like all Spans that would normally be load-balanced to it would just be dropped instead of re-routing to host1. If that is the case then the README.md should make it clear.
Pinging code owners:
- exporter/loadbalancing: @jpkrohling
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Same question for DNS and K8s, to be fair. It feels like the k8s one will cope but I'm unsure. would the DNS one need the "A" record to be changed if there was a host down?
This was discussed via Slack, here: https://cloud-native.slack.com/archives/C01N6P7KR6W/p1707838013838759
Here's a long version of what happens behind the scenes, and I appreciate if this could be summarized and added as part of the readme:
The load balancer exporter will create one exporter per endpoint, no matter the resolver (static, k8s, DNS). These exporters can be fine-tuned with options related to the sending queue and retry mechanisms. This means that if a network hiccup occurs and a data point cannot be delivered, the exporter will attempt to deliver it again periodically and might eventually fail. The load-balancing exporter will NOT attempt to re-route to a healthy endpoint.
Concretely:
- if a host from the static host is down, all telemetry for it will fail to be delivered
- if a scaling event happens and an endpoint is removed, the in-flight data destined to that endpoint will likely be retried until it eventually fails. Therefore, for highly elastic environments, it's probably a good idea to tweak the sending queue and retry mechanisms, perhaps even disabling it altogether
- when using k8s, DNS, and likely other future resolvers (AWS cloud map is close to being added), topology changes are eventually reflected on the load-balancing exporter. Some resolvers will get changes quicker than others (k8s is quicker than DNS), but there's still a window of time where the topology has changed and the load-balancer wasn't updated
Removing needs triage as it looks like the question has been answered, and the code owner has approved putting this in the README.
@alexchowle Would you be interested in posting a PR with README updates that explain this functionality?
It's already been merged in
Thank you! Sorry, missed the PR: #31271