opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

Need an indication on how the LoadBalancingExporter works with a static list of hosts when a host is down

Open alexchowle opened this issue 1 year ago • 2 comments

Component(s)

exporter/loadbalancing

Describe the issue you're reporting

With a static configuration of hosts in the LoadBalancingExporter configuration:

exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      static:
        hostnames:
          - host1:4317
          - host2:4317

If host2 has its Collector stopped, it seems like all Spans that would normally be load-balanced to it would just be dropped instead of re-routing to host1. If that is the case then the README.md should make it clear.

alexchowle avatar Feb 13 '24 13:02 alexchowle

Pinging code owners:

  • exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Feb 13 '24 13:02 github-actions[bot]

Same question for DNS and K8s, to be fair. It feels like the k8s one will cope but I'm unsure. would the DNS one need the "A" record to be changed if there was a host down?

alexchowle avatar Feb 13 '24 15:02 alexchowle

This was discussed via Slack, here: https://cloud-native.slack.com/archives/C01N6P7KR6W/p1707838013838759

Here's a long version of what happens behind the scenes, and I appreciate if this could be summarized and added as part of the readme:

The load balancer exporter will create one exporter per endpoint, no matter the resolver (static, k8s, DNS). These exporters can be fine-tuned with options related to the sending queue and retry mechanisms. This means that if a network hiccup occurs and a data point cannot be delivered, the exporter will attempt to deliver it again periodically and might eventually fail. The load-balancing exporter will NOT attempt to re-route to a healthy endpoint.

Concretely:

  • if a host from the static host is down, all telemetry for it will fail to be delivered
  • if a scaling event happens and an endpoint is removed, the in-flight data destined to that endpoint will likely be retried until it eventually fails. Therefore, for highly elastic environments, it's probably a good idea to tweak the sending queue and retry mechanisms, perhaps even disabling it altogether
  • when using k8s, DNS, and likely other future resolvers (AWS cloud map is close to being added), topology changes are eventually reflected on the load-balancing exporter. Some resolvers will get changes quicker than others (k8s is quicker than DNS), but there's still a window of time where the topology has changed and the load-balancer wasn't updated

jpkrohling avatar Feb 14 '24 09:02 jpkrohling

Removing needs triage as it looks like the question has been answered, and the code owner has approved putting this in the README.

@alexchowle Would you be interested in posting a PR with README updates that explain this functionality?

crobert-1 avatar Feb 26 '24 20:02 crobert-1

It's already been merged in

alexchowle avatar Feb 26 '24 21:02 alexchowle

Thank you! Sorry, missed the PR: #31271

crobert-1 avatar Feb 26 '24 21:02 crobert-1