opentelemetry-cpp Make GetSpan return a statically allocated invalid span

trafficstars

I’ve observed that the current implementation of GetSpan performs a new allocation each time it is called if no active span is found. Since DefaultSpan is immutable (or at least intended to be), it seems reasonable to return a statically allocated invalid span instead.

This approach could reduce unnecessary heap allocations and improve performance. However, it also introduces the possibility of someone inadvertently replacing the default invalid span with a valid one.

I’m not certain whether this change aligns with the intended design or if the allocation was deliberately chosen to provide more isolation for the default span. An alternative could be to initialize an invalid span at the start of the program if the goal is to eliminate the allocation overhead in production. Nonetheless, there might be other implications to consider.

Please consider this PR as a suggestion and an opportunity for discussion.

Aug 19 '24 07:08 michaelvanstraten

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: michaelvanstraten / name: Michael (61bea775fe4af6356687b49c6d5a725cc516f443)
:white_check_mark: login: lalitb / name: Lalit Kumar Bhasin (7ff6178f29c0638eeb8b792e83e52c52cca1f3f7, 1de2123630ac5a7192e13bab109cb0d239e18dad, 93bfdb04e19f493d7faf37bd6126e582ad343e7f)

Aug 19 '24 07:08 linux-foundation-easycla[bot]

By initializing the invalid span at process start time, I mean setting it up like this:

void main() {
    std::shared_ptr<Span> invalid_span =
      std::make_shared<DefaultSpan>(SpanContext::GetInvalid());
    opentelemetry::trace::SetSpan(invalid_span);
    // ...
}

Aug 19 '24 07:08 michaelvanstraten

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 87.87%. Comparing base (497eaf4) to head (1de2123). Report is 152 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3037      +/-   ##
==========================================
+ Coverage   87.12%   87.87%   +0.75%     
==========================================
  Files         200      195       -5     
  Lines        6109     6138      +29     
==========================================
+ Hits         5322     5393      +71     
+ Misses        787      745      -42

Files with missing lines	Coverage Δ
api/include/opentelemetry/trace/context.h	`92.31% <100.00%> (+0.65%)`	:arrow_up:

... and 99 files with indirect coverage changes

Aug 19 '24 08:08 codecov[bot]

Just curious, where are you getting issue in usage of GetSpan?

One place it is used internally is during creation of new span in Tracer:StartSpan(), and usually object of type SpanContext (and not Context) is passed as the options.parent argument. This eliminates the call to GetSpan internally.
Another place it is used is in the TextMapPropagator::Inject() implementation. The application can maintain a static invalid span, and pass it to Inject method whenever the current context is invalid.

Just couple of thoughts came to my mind as the potential solution/workarounds to avoid any changes in GetSpan.

Aug 19 '24 12:08 lalitb

Just curious, where are you getting issue in usage of GetSpan?

I wouldn't necessarily call it an "issue," but it seems suboptimal for GetSpan to allocate a new span each time a default span is needed. Ideally, it would return a statically allocated span when no active span is found in the current context.

Are there any potential downsides to returning a statically allocated span as the default?

I'm currently working on integrating OpenTelemetry into parts of Gecko (Firefox), and depending on the level of instrumentation, this could potentially impact performance.

Aug 23 '24 22:08 michaelvanstraten

Are there any potential downsides to returning a statically allocated span as the default?

I believe not. It should be fine to return static allocation for an invalid span. I just don't understand how can the flow possibly reach there. Feel free to make it ready for review.

Aug 26 '24 17:08 lalitb

I just don't understand how can the flow possibly reach there. Feel free to make it ready for review.

If you don't have a root span in main, then this will happen (since we primarily use contexts to control when tracing is active).

void main() {
    for (int i = 0; i < 100,000; i++) {
        foo(); // Will result in 100,000 allocations
    }

    auto provider = opentelemetry::trace::Provider::GetTracerProvider();
    auto tracer = provider->GetTracer("foo_library", "1.0.0");
    
    auto span = tracer->StartSpan("foo");
    auto scope = tracer->WithActiveSpan(span);
    
    for (int i = 0; i < 100,000; i++) {
        foo(); // Will result in 0 allocations
    }
};

void foo() {
    auto active_span = opentelemetry::trace::GetSpan(
        opentelemetry::context::RuntimeContext::GetCurrent());
    active_span.AddEvent("something in foo");
}

Aug 27 '24 14:08 michaelvanstraten

Sorry for the confusion, my college mentioned some potential raise conditions which turned out to be not applicable here.

Aug 31 '24 21:08 michaelvanstraten

Sorry for the confusion, my college mentioned some potential raise conditions which turned out to be not applicable here.

Just to clarify, should we continue with this PR, or should it be closed (because of the race found) ?

As a side note, if the main concern is performances, the best course of action could be to avoid reaching this case in the first place (i.e., provide a root span in main), instead of optimizing non nominal use cases.

[Edit 2024-09-27]

As a side note, if the main concern is performances, the best course of action could be to avoid reaching this case in the first place (i.e., provide a root span in main), instead of optimizing non nominal use cases.

It turns out this is a nominal case, when linking an instrumented library in a non instrumented application, where by definition adding a top level span is not an option.

Sep 04 '24 20:09 marcalff

Yes, I've marked the PR as a draft for the time being where the race was unclear, but it does not apply here. So please continue.

The reason for this PR is that for users not familiar with the lib this might not be obvious.

If this change is too invasive, we can close this PR and add a note in the docs.

Sep 04 '24 20:09 michaelvanstraten

Use case

About the use case itself, i.e., to understand when this code path is involved:

Assume a library libL, instrumented with OpenTelemetry.

This library instruments some spans, so it invokes opentelemetry::trace::GetSpan() as reported.

The library does not create a top level span, because it does not have a main() function.

Now, assume an application that links with libL.

Instrumented application

When an application is instrumented for opentelemetry-cpp, the application created a top level span, in main() or in a major entry point.

During execution, there will be a trace in the runtime context, so the code mentioned in this report will not be executed.

Non instrumented application

When an application is not instrumented (no SDK, no exporters, nothing in the runtime trace context), all the calls from the instrumented library libL will use noop code.

Because there is no current span, the instrumentation code in the library will hit the code path reported.

This kind of deployment:

instrumented library (making API calls)
non instrumented application (no SDK installed)

is valid, and will happen a lot, so performances in this case is a valid concern.

Code change

What the code change proposes is to trade memory allocation for a shared singleton object.

Note however that this global object is referred to with a nostd::shared_ptr<Span>.

Each thread will use it's own copy of nostd::shared_ptr<Span>, pointing to a global, shared, object.

My concern is that the fix is not as simple as it seem, and it may actually make performance worse.

Behind a shared_ptr, there is a control block object, that keeps the reference count and the object.

This control block object is shared between all instances of shared_ptr, so that for a global span singleton, there will be an associated global control block.

Now, on each creation of a shared_ptr object, the reference count in the control block will be incremented, say with an atomic operation. For each object destruction, there will be an atomic decrement.

All the atomic increment and decrement, for all threads in the process, will all operate on the same memory, which is the globally shared control block.

This can lead to stalls for all threads, and can affect any part of the application code, in every place where a span is instrumented.

See issue #3010, the number of CPU cores for the machine used is 132.

On a machine with 132 CPU cores, every thread performing atomic operations, from every place in the code base that instruments a span, will make concurrent access on the same, global, unique, span control block.

While I don't have measurements to compare the two implementations (malloc+free per thread, versus atomic increment and decrement on the same global), the concern is that the proposed solution will not scale and causes bottlenecks.

Sep 27 '24 20:09 marcalff

Okay, I get the concern, make this a thread local should not be a problem.

Sep 27 '24 22:09 michaelvanstraten

Another question the, if we use a static here in the method and the method gets inlined, do we have separate statics for each inlining?

I will check compiler explorer maybe.

Sep 27 '24 22:09 michaelvanstraten

Deploy Preview for opentelemetry-cpp-api-docs canceled.

Name	Link
Latest commit	1de2123630ac5a7192e13bab109cb0d239e18dad
Latest deploy log	https://app.netlify.com/sites/opentelemetry-cpp-api-docs/deploys/6725483eb01dfc00085b1103

Oct 07 '24 14:10 netlify[bot]

@marcalff This should elevate the concerns with parallelism, since each thread gets its own instance.

Another question the, if we use a static here in the method and the method gets inlined, do we have separate statics for each inlining?

Apparently, the answer is no; the static variable is not monomorphized and remains a single instance shared across all inlinings.

Source: StackOverflow answer

Oct 07 '24 14:10 michaelvanstraten

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

Oct 08 '24 02:10 owent

If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

Can you suggest an alternative?

Oct 09 '24 08:10 michaelvanstraten

I wouldn't necessarily call it an "issue," but it seems suboptimal for GetSpan to allocate a new span each time a default span is needed. Ideally, it would return a statically allocated span when no active span is found in the current context. Are there any potential downsides to returning a statically allocated span as the default? I'm currently working on integrating OpenTelemetry into parts of Gecko (Firefox), and depending on the level of instrumentation, this could potentially impact performance.

@michaelvanstraten Sorry for bringing this up again. Could you please clarify in which specific hot-path scenario this method is becoming a bottleneck? While I agree that returning a statically allocated default span would be ideal, I believe it's important to avoid prematurely fixing this as the potential bottleneck without more context about the impact. Specifically more, when there are changes at the API code.

Oct 09 '24 16:10 lalitb

If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

Can you suggest an alternative?

I have no idea about this problem. In my understanding, copy the instance may be a better solution if the cost of copy is not bottleneck.

Oct 10 '24 13:10 owent

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

Oct 18 '24 10:10 michaelvanstraten

Could you please clarify in which specific hot-path scenario this method is becoming a bottleneck? While I agree that returning a statically allocated default span would be ideal, I believe it's important to avoid prematurely fixing this as the potential bottleneck without more context about the impact. Specifically more, when there are changes at the API code.

@lalitb I think that @marcalff summarized the issue well in his https://github.com/open-telemetry/opentelemetry-cpp/pull/3037#issuecomment-2380002451.

Oct 18 '24 10:10 michaelvanstraten

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Oct 22 '24 06:10 owent

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

Oct 22 '24 14:10 michaelvanstraten

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

Oct 24 '24 07:10 owent

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

Wouldn't that be a general problem, since in the other case we also allocate within the DLL. Could we make this a lazy statci?

Nov 01 '24 19:11 michaelvanstraten

@lalitb I think that @marcalff summarized the issue well in his https://github.com/open-telemetry/opentelemetry-cpp/pull/3037#issuecomment-2380002451.

The summarization assumes that GetSpan() is called while creating spans. Which is not entirely correct. Again, @michaelvanstraten please provide the exact scenario where you see performance overhead because of GetSpan() getting called in the hot path. The place I know is while using trace propagators, and user can always create custom propagators to avoid that.

Nov 01 '24 20:11 lalitb

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

Wouldn't that be a general problem, since in the other case we also allocate within the DLL. Could we make this a lazy statci?

Most modules free objects in the same modules where it's created. The old implementation copy the memory and won't make this "shared" object be free in another module, but if we use static or thread_local here, it may happens.

Nov 04 '24 16:11 owent

@michaelvanstraten , All,

To summarize the state of this discussion:

There are some remaining questions, on the use case itself, to clarify whether there is an issue on a main code path (affecting a lot of users), or on a degraded test cases (with less impact then, and possible mitigations). This is mostly the comments from @lalitb
There are some concerns about whether the fix, as written, actually improve things, or if there is a risk of performance degradation because of scaling issues when using a shared pointer. This is mostly my own (@marcalff) comments.
There are also some concerns on how the patch can be changed to alleviate this, using thread local storage, per thread singletons, all this in a header only include file. This area is extremely sensitive in the API, and there is no proposed solution as of now, just a discussion.

Overall, my personal opinion, shared with also other maintainers, is that it is better to put this issue to rest.

Optimizing the opentelemetry-cpp code for performances is a very valuable goal. There are very likely other areas that can be improved, possibly giving better return in term of performances, and at a much lesser cost in term of development effort or risk, compared to touching GetSpan() in the API, even when it appears to be trivial with 4 lines.

Having this discussion was valuable, but I don't think the outcome can result in a fix that can be merged, therefore this PR will now be closed.

What is direly missing in opentelemetry-cpp is performance benchmark to start with, to help making fact based decisions based on measurements, instead of making assumptions about which part of the code is responsible for overhead.

https://wiki.c2.com/?PrematureOptimization

Nov 05 '24 23:11 marcalff

opentelemetry-cpp opentelemetry-cpp copied to clipboard

Make GetSpan return a statically allocated invalid span

Codecov Report

Use case

Instrumented application

Non instrumented application

Code change

✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.

opentelemetry-cpp
opentelemetry-cpp copied to clipboard

Deploy Preview for opentelemetry-cpp-api-docs canceled.