opentelemetry-collector-contrib [Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0

Component(s)

No response

Describe the issue you're reporting

We have two collectors tier 1 and tier 2. Tier 1 collectors have a loadbalancingexporter configured for exporting spans to Tier 2 collectors.

We are monitoring the number of spans received on tier 1 and tier 2 collectors using the metric otelcol_receiver_accepted_spans

When we upgrade all our opentelemetry libraries to v0.94.0, I am seeing a huge mismatch b/w the number of spans accepted on tier 1 vs tier 2. It worked fine till v0.93.0. Can someone please help?

Feb 14 '24 19:02 aishyandapalli

Pinging code owners for exporter/loadbalancing: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Feb 26 '24 20:02 github-actions[bot]

I think this is a result of https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/30141, which aggregates exported data to make it more efficient. This likely reduced the output numbers of the metric you're referencing, meaning there's no data actually being dropped or missed.

Are you missing any data you expected? Is there any reason for concern, in addition to the otelcol_receiver_accepted_spans metric being lower? Just want to make sure data isn't actually being dropped here.

Feb 28 '24 22:02 crobert-1

The number of spans should have remained stable, batching it should reduce the number of requests but not the amount of data.

Would you be able to share the relevant parts of your config? Bonus points if you can share a telemetrygen command that exemplifies the problem.

Mar 04 '24 10:03 jpkrohling

Just adding some more info here for others who may have found this issue. This may help other people investigate while we are also trying to figure out stuff internally.

We are seeing the same pattern and only in certain clusters where we have heavy traffic, in the region of 5 millions of spans per second. We also suspect it is the change from 0.93.0 to 0.94.0 on the named PR because the there are only 2 metrics where we can spot an obvious difference: otelcol_loadbalancer_num_resolutions, and otelcol_loadbalancer_backend_outcome. Attaching the 7 days graph, where the switch to 0.94.0 was done prior to our normal peak traffic starting just after 00:00-ish. Then we see basically a flat line for otelcol_loadbalancer_backend_outcome while at the same time otelcol_loadbalancer_num_resolutions reflects almost perfectly with the number of pods we have at layer-2. There is also very evident (cpu) throttling happening while it is not shipping, and there appears to be "hotspots" across pods, i.e. some pods were doing fine while others had high cpu/mem usage while not shipping.

load_balancing_outcome

Note that we have autoscaling in both layer-1 and layer-2 hence you see these massive fluctuations in number of backends, and we have double checked and is correctly matched to the number of pods in layer-2. (Yes, we have that many pods in layer-2.)

load_balancing_num_backend

One thing which we have not tried - at the time of writing - is to switch the resolver from k8 to dns and against the headless service of layer-2. It is not obvious this is the cause but the limited amount of profiling we managed to do while trying to restore service did show heavy mem usage on k8s api machinery. Unfortunately, I am not too familiar with how the different resolver behaves and where/when does that triggered to get a new list of endpoints.

memory_profile_93_to_93

Also worthy to mention is that we are not seeing the same behaviour in other clusters that has a stable throughput at just under 1 million spans a second. Furthermore, we have in fact upgraded all the way to version 0.97.0 in other clusters that does not go through heavy burst cycles and have found an improve in (total) cpu usage between 10% to 20%.

I will see if I can replicate this with telemetrygen locally when I get some time but that may not be any time soon, or even worthwhile if having a static list of target does not trigger this behaviour.

Edit: Happy to take any guidance from the maintainers on what we can look out for or adjustment in settings for the next time we try an upgrade :)

Apr 12 '24 13:04 edwintye

After quite a bit of experimentation and trying to understand the new behaviour, we have successfully upgrade to 0.97.0 everywhere. Turns out the biggest problem was the batch size in the batch processor in the pipeline that is just in front of the load balancing exporter. We have our batch processor as the last step of the processors pipeline. The default setting (commented out part)

processors:
  batch:
    # the two settings below are the defaults
    # send_batch_max_size: 0
    # send_batch_size: 8192

can lead to a scenario where the batch is very large under heavy load as it is not bounded. Lowering both send_batch_size and send_batch_max_size has allowed us to return to an acceptable cpu/memory and the number of export back to normal. We have yet to find a "perfect" max batch size; our current max size is in the region of 4000 and collector is working fine.

I am not familiar with the codebase but from the limited understanding after looking at the code, it feels like the aforementioned PR changed the interaction from:

Original batch -> ship in a FIFO manner.

to

Original batch -> Allocate and split into new batches according to the number of endpoints -> Ship new batches.

This change means that now we would need 2x the previous memory (in the worst case) and also have the possibility that the spans are stuck in the exporter for way longer than desired as it is trying to re-batch and fails to handle the continuous pressure via the receiver. I am also not sure whether a change in the backends - layer-2 pod terminated - during this re-batch process will cause additional issues.

During our debugging, we also discover a potential missing tag in the metrics. Namely, https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/27840d58551a1887833e26a1af279aa667cf46ca/exporter/loadbalancingexporter/trace_exporter.go#L122-L123

suggests that success should be available but it is not defined in the histogram https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/27840d58551a1887833e26a1af279aa667cf46ca/exporter/loadbalancingexporter/metrics.go#L53-L61 while it is in the corresponding count https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/27840d58551a1887833e26a1af279aa667cf46ca/exporter/loadbalancingexporter/metrics.go#L62-L71 but to be honest these metrics didn't provide much insight in this particular occasion.

Apr 24 '24 14:04 edwintye

Hey @edwintye, we are observing the same issue as you. I wanted to also perform the Collector profilling, how did you manage to get that working? 🤔

May 10 '24 15:05 MasterExecuson

I have a feeling that this is fixed in v0.101.0, as part of another issue reporting a memory issue with the exporter. Once that version is out (should be out tomorrow, 21 May 2024), could you give it a try and report back?

May 21 '24 19:05 jpkrohling

opentelemetry-collector-contrib opentelemetry-collector-contrib copied to clipboard

[Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0

Component(s)

Describe the issue you're reporting

opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard