opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Increased request latency in 0.106.1

Open arielvalentin opened this issue 1 year ago • 17 comments
trafficstars

Describe the bug

This issue was originally reported in CNCF slack by Srinivas Bevara and confirmed by @codeboten who asked me to open this issue

After Upgrade to v0.106.1 from v0.104.0 we have run into issues where we see an increase in request latency being reported by both the collector's internal telemetry and our ingress proxies.

Memory remains unaffected however pprof is reporting garbage collection seems to have increased due internal telemetry tracing related allocations:

Image

Image

Image

Image

This is unexpected since tracing telemetry is disabled in our deployments.

Steps to reproduce

Run a load test comparing v0.104.0 and v0.106.1

What did you expect to see?

  • p99 Request latency and throughput should be comparable between both versions ~74ms
  • Internal telemetry must not generate recording spans when it is disabled

What did you see instead?

  • An increase in p99 latency to > 1s
  • Profiling reveals increased objected allocations related to internal telemetry span data
  • Profiling reveals an increase in time spent in GC

What version did you use?

v0.106.1

What config did you use?

Our configuration has many transformations and connector pipelines. The bare minimum you may likely need is the one you use for benchmarking.

Environment

Ubuntu 20.04.1

Additional context

Screen shots of increased p99 latency and CPU and steady memory

Image

Image

Image

arielvalentin avatar Aug 12 '24 20:08 arielvalentin

After some discussion amongst maintainers, we want to provide a quick fix for this issue via the feature gate service.noopTracerProvider, and further investigate what's the cause of this problem before deciding if there should be a permanent configuration option to set a no-op tracer provider.

Returning a noop tracer provider if no tracing is configured in the telemetry settings would break the current behaviour, which the zpages extension relies on to provide telemetry about traces.

codeboten avatar Aug 12 '24 23:08 codeboten

Was there a recent change in zpages results in an increase in the number spans being allocated?

arielvalentin avatar Aug 13 '24 01:08 arielvalentin

Thanks, @arielvalentin, for taking this up and raising a GitHub issue.

bvsvas avatar Aug 16 '24 06:08 bvsvas

@codeboten ~did this feature flag make it into v0.107.0?~

Yes it did https://github.com/open-telemetry/opentelemetry-collector/releases/tag/v0.107.0

arielvalentin avatar Aug 16 '24 15:08 arielvalentin

I recently upgraded to v0.107.0 from v0.104.0. During off peak times performance is about on-par but during our high peak times our p99 skyrockets and memory spikes forcing the memory limiter to trigger hard limits responding with 503s.

I have enabled the feature gate to ensure we are using the noop-tracer but there still seems to be an increase in memory allocations and latency.

Our heavy usage of OTTL has not changed much between versions and as far as I can tell that is what appears the most in the profiles along with a high number of context objects being allocated by the noop-tracer.

profiles-20240823.tar.gz

arielvalentin avatar Aug 24 '24 15:08 arielvalentin

Re-opening to continue the investigation

codeboten avatar Aug 24 '24 15:08 codeboten

@arielvalentin Did the OTTL performance degrade, or are you saying it remained the majority?

TylerHelmuth avatar Aug 24 '24 15:08 TylerHelmuth

The profiles revealed that in 104 seemed to show OTTL execute was where it spent most of its time and allocations but our performance was pretty good ~15K spans per second per collector. p99 19ms.

Other than seeing the increased allocations to context objects related to nooptracer, the profiles between 104 and 107 look about the same to me so I don't have a reason to believe there is a regression in OTTL performance.

I'll see if I can dig up my 104 profiles and attach them as well.

arielvalentin avatar Aug 24 '24 15:08 arielvalentin

From @arielvalentin

I have enabled the feature gate to ensure we are using the noop-tracer but there still seems to be an increase in memory allocations and latency.

Other than seeing the increased allocations to context objects related to nooptracer, the profiles between 104 and 107 look about the same to me so I don't have a reason to believe there is a regression in OTTL performance.

So can be service.noopTracerProvider considered as a workaround?

pavolloffay avatar Aug 26 '24 09:08 pavolloffay

So can be service.noopTracerProvider considered as a workaround?

I would say "no" because

there still seems to be an increase in memory allocations and latency.

iblancasa avatar Aug 26 '24 09:08 iblancasa

Investigating this further, this might be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33508

codeboten avatar Aug 26 '24 17:08 codeboten

I just ran load tests for v0.107.0 and here's what i have when i enabled the noop tracer provider feature gate:

Before using --feature-gates service.noopTracerProvider

Screenshot 2024-08-26 at 11 39 03 AM

With --feature-gates service.noopTracerProvider

Screenshot 2024-08-26 at 11 38 00 AM

This seems different than what you're seeing @arielvalentin, can you confirm that the feature gate is enabled?

codeboten avatar Aug 26 '24 18:08 codeboten

Yes. We currently run it with --feature-gates service.noopTracerProvider

You can see in the profiles I attached the noopTracer is used:

Image

arielvalentin avatar Aug 26 '24 20:08 arielvalentin

Here are profiles for 104 profiles-104.tar.gz

arielvalentin avatar Aug 26 '24 21:08 arielvalentin

thanks @arielvalentin, I missed that when i looked your attached profiles a couple of days ago. good to know that using the no-op tracer provider at least reduces the overall memory consumption, even if its still higher than before the change to add tracing to ottl.

codeboten avatar Aug 26 '24 21:08 codeboten

No worries @codeboten

Something I may have not mentioned already, but my colleague @antonio had observed the latency regression as early as 105.

We may downgrade to 105 and collect profiles there as well.

arielvalentin avatar Aug 27 '24 14:08 arielvalentin

Something I may have not mentioned already, but my colleague @antonio had observed the latency regression as early as 105.

That's correct. Notice how request duration went significantly up in the graph below, when I tested 105 back in July.

Image

Investigating this further, this might be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33508

The time of when this PR was merged does indeed align with when we started seeing the issue.

antonio avatar Aug 27 '24 15:08 antonio

Just adding what I've found here real quick;

I've done a quick benchmark to see the difference between an actual tracer impl, vs noop tracer, vs all tracing removed in ottl via the transform processor. You can see the benchmark here. All it does is run the transform processor with a batch of 100 logs, with a simple set operation (which seems like a pretty normal workload).

These are the results:

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/processor/transformprocessor
cpu: Apple M3 Pro
                 │ no-tracing.txt │           noop-tracer.txt            │           test-tracer.txt            │
                 │     sec/op     │    sec/op     vs base                │    sec/op     vs base                │
SetLogRecords-12      50.29µ ± 1%   120.55µ ± 4%  +139.70% (p=0.002 n=6)   426.21µ ± 5%  +747.43% (p=0.002 n=6)

                 │ no-tracing.txt │            noop-tracer.txt            │            test-tracer.txt            │
                 │      B/op      │     B/op       vs base                │     B/op       vs base                │
SetLogRecords-12     96.21Ki ± 0%   227.45Ki ± 0%  +136.42% (p=0.002 n=6)   758.38Ki ± 1%  +688.28% (p=0.002 n=6)

                 │ no-tracing.txt │           noop-tracer.txt           │           test-tracer.txt           │
                 │   allocs/op    │  allocs/op   vs base                │  allocs/op   vs base                │
SetLogRecords-12      1.402k ± 0%   3.802k ± 0%  +171.18% (p=0.002 n=6)   9.802k ± 0%  +599.14% (p=0.002 n=6)

You can see that even the noop tracer increases these benchmarks by quite a bit. And it adds up with what I'm seeing in both the issue I posted and here, which is that the overhead of tracing each individual evaluation of the StatementSequence seems like overkill.

Of course, if you do more processing in a single ottl statement, you'd probably get numbers that look a little less striking, so take the exact numbers here with a grain of salt. I don't think the scenario is all too unlikely, though.

It feels like there should not be tracing in StatementSequence at all, but maybe a few metric counters instead. The overhead of the tracing just doesn't seem worth it.

BinaryFissionGames avatar Aug 28 '24 02:08 BinaryFissionGames

I have got some more for you here v0.107.4 profiles-20240828.tar.gz

cc: @songy23

arielvalentin avatar Aug 28 '24 15:08 arielvalentin

I am going to submit a PR to revert the ottl tracing stuff.

TylerHelmuth avatar Aug 28 '24 16:08 TylerHelmuth

I am going to submit a PR to revert the ottl tracing stuff.

How about placing it behind a feature gate?

iblancasa avatar Aug 28 '24 16:08 iblancasa

Thanks again for investigating. I am on the path to downgrade to 0.104.0 and will be eagerly awaiting the next release.

Once it's out, I will deploy to a subset of our canary instances and report back profiles and latency info.

arielvalentin avatar Aug 28 '24 21:08 arielvalentin

@bvsvas have you all had a chance to generate profiles?

IIRC you all were not using OTTL so the latency you all reported would not be related to the transformers right?

This is the snippet you posted in slack:

service:
  extensions: [apmauth, pprof]
  pipelines:
    traces:
      receivers: [otlp/withauth]
      processors: [probabilistic_sampler]
      exporters: [customexporter]
    metrics:
      receivers: [otlp/withauth, prometheus]
      processors: [cumulativetodelta, resource]
      exporters: [customexporter]

arielvalentin avatar Aug 29 '24 13:08 arielvalentin

👋🏼 Is there a release schedule/project board handy I can subscribe to?

arielvalentin avatar Sep 04 '24 14:09 arielvalentin

The release schedule is documented here: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/release.md#release-schedule

codeboten avatar Sep 04 '24 18:09 codeboten

I also faced this same bug, Another work around. Set gomemlimit, So that GC is kicked in to free up the mem.

https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications#adding-gomemlimit-to-weaviate--before--after

deepakd10 avatar Feb 04 '25 04:02 deepakd10