opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

[processor/transform] Tracing taking more CPU resources than expected

Open BinaryFissionGames opened this issue 1 year ago • 2 comments

Component(s)

processor/transform

What happened?

Description

Not sure if "bug" is the right label for this, but the tracing component for the transform processor seems to be taking a significant amount of process time in high throughput scenarios.

We had a report of high CPU usage on a collector (monitoring ~10k files), and we got a profile for it. One interesting insight to come out of the profile is that ~40% of the path for the transform processor was taken by tracer::Start.

Screenshot 2024-08-27 at 4 16 03 PM

That seems like a lot of resource spent there, and this is a scenario where the collector is not configured to emit traces at all, so this effectively does nothing. I'm wondering if this is something we can optimize? I haven't looked deep enough into it just yet to understand the pitfalls.

Collector version

0.104.0

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

BinaryFissionGames avatar Aug 27 '24 20:08 BinaryFissionGames

Pinging code owners:

  • processor/transform: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Aug 27 '24 20:08 github-actions[bot]

OK, so I dived into the code a bit, and it looks like there's a span created for each statement. So if I execute, say, 10 statements, I'll get 10 spans for each log record flowing through the transform processor (this is actually for everywhere using OTTL statements, not just in the transform processor AFAICT).

I think it would be nice to be able to dial that back a bit, maybe through some config setting. It just seems like a lot of tracing for the smaller contexts (datapoint, span, spanevent, logrecord), and eats up a decent amount of CPU.

BinaryFissionGames avatar Aug 27 '24 21:08 BinaryFissionGames

@codeboten and I were looking at this for https://github.com/open-telemetry/opentelemetry-collector/issues/10858. The tracing is also consuming a lot of memory.

The only thing we've been able to identify is that the ctx that is passed into the SDK is very large. But it isn't clear to me why the OTEL Go SDK is under performing in this scenario.

TylerHelmuth avatar Aug 27 '24 23:08 TylerHelmuth

Also, scaling back the per-statement spans won't help much, as most of the performance hit is from the first, outer span based on our testing.

TylerHelmuth avatar Aug 27 '24 23:08 TylerHelmuth

Thanks for pointing me to that other issue. I'll probably end up investigating more myself tomorrow.

BinaryFissionGames avatar Aug 28 '24 00:08 BinaryFissionGames