opentelemetry-collector Increased request latency in 0.106.1

trafficstars

Describe the bug

This issue was originally reported in CNCF slack by Srinivas Bevara and confirmed by @codeboten who asked me to open this issue

After Upgrade to v0.106.1 from v0.104.0 we have run into issues where we see an increase in request latency being reported by both the collector's internal telemetry and our ingress proxies.

Memory remains unaffected however pprof is reporting garbage collection seems to have increased due internal telemetry tracing related allocations:

This is unexpected since tracing telemetry is disabled in our deployments.

Steps to reproduce

Run a load test comparing v0.104.0 and v0.106.1

What did you expect to see?

p99 Request latency and throughput should be comparable between both versions ~74ms
Internal telemetry must not generate recording spans when it is disabled

What did you see instead?

An increase in p99 latency to > 1s
Profiling reveals increased objected allocations related to internal telemetry span data
Profiling reveals an increase in time spent in GC

What version did you use?

v0.106.1

What config did you use?

Our configuration has many transformations and connector pipelines. The bare minimum you may likely need is the one you use for benchmarking.

Environment

Ubuntu 20.04.1

Additional context

Screen shots of increased p99 latency and CPU and steady memory

Aug 12 '24 20:08 arielvalentin

After some discussion amongst maintainers, we want to provide a quick fix for this issue via the feature gate service.noopTracerProvider, and further investigate what's the cause of this problem before deciding if there should be a permanent configuration option to set a no-op tracer provider.

Returning a noop tracer provider if no tracing is configured in the telemetry settings would break the current behaviour, which the zpages extension relies on to provide telemetry about traces.

Aug 12 '24 23:08 codeboten

Was there a recent change in zpages results in an increase in the number spans being allocated?

Aug 13 '24 01:08 arielvalentin

Thanks, @arielvalentin, for taking this up and raising a GitHub issue.

Aug 16 '24 06:08 bvsvas

@codeboten ~did this feature flag make it into v0.107.0?~

Yes it did https://github.com/open-telemetry/opentelemetry-collector/releases/tag/v0.107.0

Aug 16 '24 15:08 arielvalentin

I recently upgraded to v0.107.0 from v0.104.0. During off peak times performance is about on-par but during our high peak times our p99 skyrockets and memory spikes forcing the memory limiter to trigger hard limits responding with 503s.

I have enabled the feature gate to ensure we are using the noop-tracer but there still seems to be an increase in memory allocations and latency.

Our heavy usage of OTTL has not changed much between versions and as far as I can tell that is what appears the most in the profiles along with a high number of context objects being allocated by the noop-tracer.

profiles-20240823.tar.gz

Aug 24 '24 15:08 arielvalentin

Re-opening to continue the investigation

Aug 24 '24 15:08 codeboten

@arielvalentin Did the OTTL performance degrade, or are you saying it remained the majority?

Aug 24 '24 15:08 TylerHelmuth

The profiles revealed that in 104 seemed to show OTTL execute was where it spent most of its time and allocations but our performance was pretty good ~15K spans per second per collector. p99 19ms.

Other than seeing the increased allocations to context objects related to nooptracer, the profiles between 104 and 107 look about the same to me so I don't have a reason to believe there is a regression in OTTL performance.

I'll see if I can dig up my 104 profiles and attach them as well.

Aug 24 '24 15:08 arielvalentin

From @arielvalentin

I have enabled the feature gate to ensure we are using the noop-tracer but there still seems to be an increase in memory allocations and latency.

Other than seeing the increased allocations to context objects related to nooptracer, the profiles between 104 and 107 look about the same to me so I don't have a reason to believe there is a regression in OTTL performance.

So can be service.noopTracerProvider considered as a workaround?

Aug 26 '24 09:08 pavolloffay

So can be service.noopTracerProvider considered as a workaround?

I would say "no" because

there still seems to be an increase in memory allocations and latency.

Aug 26 '24 09:08 iblancasa

Investigating this further, this might be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33508

Aug 26 '24 17:08 codeboten

I just ran load tests for v0.107.0 and here's what i have when i enabled the noop tracer provider feature gate:

Before using --feature-gates service.noopTracerProvider

Screenshot 2024-08-26 at 11 39 03 AM

With --feature-gates service.noopTracerProvider

Screenshot 2024-08-26 at 11 38 00 AM

This seems different than what you're seeing @arielvalentin, can you confirm that the feature gate is enabled?

Aug 26 '24 18:08 codeboten

Yes. We currently run it with --feature-gates service.noopTracerProvider

You can see in the profiles I attached the noopTracer is used:

Aug 26 '24 20:08 arielvalentin

Here are profiles for 104 profiles-104.tar.gz

Aug 26 '24 21:08 arielvalentin

thanks @arielvalentin, I missed that when i looked your attached profiles a couple of days ago. good to know that using the no-op tracer provider at least reduces the overall memory consumption, even if its still higher than before the change to add tracing to ottl.

Aug 26 '24 21:08 codeboten

No worries @codeboten

Something I may have not mentioned already, but my colleague @antonio had observed the latency regression as early as 105.

We may downgrade to 105 and collect profiles there as well.

Aug 27 '24 14:08 arielvalentin

Something I may have not mentioned already, but my colleague @antonio had observed the latency regression as early as 105.

That's correct. Notice how request duration went significantly up in the graph below, when I tested 105 back in July.

Investigating this further, this might be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33508

The time of when this PR was merged does indeed align with when we started seeing the issue.

Aug 27 '24 15:08 antonio

Just adding what I've found here real quick;

I've done a quick benchmark to see the difference between an actual tracer impl, vs noop tracer, vs all tracing removed in ottl via the transform processor. You can see the benchmark here. All it does is run the transform processor with a batch of 100 logs, with a simple set operation (which seems like a pretty normal workload).

These are the results:

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/processor/transformprocessor
cpu: Apple M3 Pro
                 │ no-tracing.txt │           noop-tracer.txt            │           test-tracer.txt            │
                 │     sec/op     │    sec/op     vs base                │    sec/op     vs base                │
SetLogRecords-12      50.29µ ± 1%   120.55µ ± 4%  +139.70% (p=0.002 n=6)   426.21µ ± 5%  +747.43% (p=0.002 n=6)

                 │ no-tracing.txt │            noop-tracer.txt            │            test-tracer.txt            │
                 │      B/op      │     B/op       vs base                │     B/op       vs base                │
SetLogRecords-12     96.21Ki ± 0%   227.45Ki ± 0%  +136.42% (p=0.002 n=6)   758.38Ki ± 1%  +688.28% (p=0.002 n=6)

                 │ no-tracing.txt │           noop-tracer.txt           │           test-tracer.txt           │
                 │   allocs/op    │  allocs/op   vs base                │  allocs/op   vs base                │
SetLogRecords-12      1.402k ± 0%   3.802k ± 0%  +171.18% (p=0.002 n=6)   9.802k ± 0%  +599.14% (p=0.002 n=6)

You can see that even the noop tracer increases these benchmarks by quite a bit. And it adds up with what I'm seeing in both the issue I posted and here, which is that the overhead of tracing each individual evaluation of the StatementSequence seems like overkill.

Of course, if you do more processing in a single ottl statement, you'd probably get numbers that look a little less striking, so take the exact numbers here with a grain of salt. I don't think the scenario is all too unlikely, though.

It feels like there should not be tracing in StatementSequence at all, but maybe a few metric counters instead. The overhead of the tracing just doesn't seem worth it.

Aug 28 '24 02:08 BinaryFissionGames

I have got some more for you here v0.107.4 profiles-20240828.tar.gz

cc: @songy23

Aug 28 '24 15:08 arielvalentin

I am going to submit a PR to revert the ottl tracing stuff.

Aug 28 '24 16:08 TylerHelmuth

I am going to submit a PR to revert the ottl tracing stuff.

How about placing it behind a feature gate?

Aug 28 '24 16:08 iblancasa

Thanks again for investigating. I am on the path to downgrade to 0.104.0 and will be eagerly awaiting the next release.

Once it's out, I will deploy to a subset of our canary instances and report back profiles and latency info.

Aug 28 '24 21:08 arielvalentin

@bvsvas have you all had a chance to generate profiles?

IIRC you all were not using OTTL so the latency you all reported would not be related to the transformers right?

This is the snippet you posted in slack:

service:
  extensions: [apmauth, pprof]
  pipelines:
    traces:
      receivers: [otlp/withauth]
      processors: [probabilistic_sampler]
      exporters: [customexporter]
    metrics:
      receivers: [otlp/withauth, prometheus]
      processors: [cumulativetodelta, resource]
      exporters: [customexporter]

Aug 29 '24 13:08 arielvalentin

👋🏼 Is there a release schedule/project board handy I can subscribe to?

Sep 04 '24 14:09 arielvalentin

The release schedule is documented here: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/release.md#release-schedule

Sep 04 '24 18:09 codeboten

I also faced this same bug, Another work around. Set gomemlimit, So that GC is kicked in to free up the mem.

https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications#adding-gomemlimit-to-weaviate--before--after

Feb 04 '25 04:02 deepakd10

opentelemetry-collector opentelemetry-collector copied to clipboard

Increased request latency in 0.106.1

opentelemetry-collector
opentelemetry-collector copied to clipboard