opentelemetry-cpp
opentelemetry-cpp copied to clipboard
Parent-Child discover error in examples/simple when built as release and shared library.
Describe your environment My system is Ubuntu 18.04.5 LTS. And the compiler is gcc 7.5.0
Steps to reproduce I am trying to use opentelemetry as a shared library for my project. But I found that when I try to build it as release and shared library, there is some error in Parent-Child discover. The problem can be reproduced by just running the examples/simple/example_simple. Build in Debug(default) in cmake or static library wont show this error.
I use the following code to build.
cmake .. -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON
cmake --build . --target all
What is the expected behavior? In example/simple the parent_span_id of f1 should be span_id of f2. And parent_span_id of f2 should be span_id of library. But both showing 0000000000000000.
What is the actual behavior?
{
name : f1
trace_id : cbbb2e3813a548c408de4b484dc723a1
span_id : ce90a9eaf3214e19
tracestate :
parent_span_id: 0000000000000000
start : 1634260358826620364
duration : 7665
description :
span kind : Internal
status : Unset
attributes :
events :
links :
resources :
service.name: unknown_service
telemetry.sdk.version: 1.0.0
telemetry.sdk.language: cpp
telemetry.sdk.name: opentelemetry
instr-lib : foo_library
}
{
name : f1
trace_id : 3dc7f1b4da6e2eedff960b136ab6e305
span_id : a145e9e7fca8f666
tracestate :
parent_span_id: 0000000000000000
start : 1634260358826785278
duration : 1634
description :
span kind : Internal
status : Unset
attributes :
events :
links :
resources :
service.name: unknown_service
telemetry.sdk.version: 1.0.0
telemetry.sdk.language: cpp
telemetry.sdk.name: opentelemetry
instr-lib : foo_library
}
{
name : f2
trace_id : 94d5abc1a0839e21a97ab9bb3564cbea
span_id : 8c0d473664eeeb48
tracestate :
parent_span_id: 0000000000000000
start : 1634260358826616048
duration : 296248
description :
span kind : Internal
status : Unset
attributes :
events :
links :
resources :
service.name: unknown_service
telemetry.sdk.version: 1.0.0
telemetry.sdk.language: cpp
telemetry.sdk.name: opentelemetry
instr-lib : foo_library
}
{
name : library
trace_id : 0f0bc2eebeb389b448502ac1f015ed1e
span_id : 72fdbadc8e352b5d
tracestate :
parent_span_id: 0000000000000000
start : 1634260358826597245
duration : 460802
description :
span kind : Internal
status : Unset
attributes :
events :
links :
resources :
service.name: unknown_service
telemetry.sdk.version: 1.0.0
telemetry.sdk.language: cpp
telemetry.sdk.name: opentelemetry
instr-lib : foo_library
}
Also the there is a test failed passing.
/home/xxx/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:130: Failure
Value of: span2.at(0)->GetParentSpanId().IsValid()
Actual: false
Expected: true
/home/xxx/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:142: Failure
Expected equality of these values:
span1.at(0)->GetTraceId()
Which is: 16-byte object <11-E7 89-39 C6-F1 C7-94 00-2A 0C-55 77-86 CC-F1>
span2.at(0)->GetTraceId()
Which is: 16-byte object <7A-F8 0A-45 00-73 E5-4C 09-3E D9-58 8D-D0 60-F3>
/home/chunhao.shen/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:143: Failure
Expected equality of these values:
span2.at(0)->GetParentSpanId()
Which is: 8-byte object <00-00 00-00 00-00 00-00>
span1.at(0)->GetSpanId()
Which is: 8-byte object <4D-6D D3-1B 08-D4 0C-9F>
This is weird. It works fine with ubuntu 18.04 + gcc-4.8
and ubuntu 20.04 + gcc-9.3
in our github CI environment. I did a quick test on ubuntu 18.04 and got the same behavior as yours. Need to debug it further. Thanks for reporting the issue.
Note that the traceid is also different. Some guesses when compiler doing optimization, the thread local "Stack" in runtime_context behave funny. Just a thought.
@lalitb This is our patch:
--- a/api/include/opentelemetry/context/runtime_context.h
+++ b/api/include/opentelemetry/context/runtime_context.h
@@ -300,7 +300,13 @@ class ThreadLocalContextStorage : public RuntimeContextStorage
Context *base_;
};
- Stack &GetStack()
+ static Stack &GetStack() __attribute__((noinline, noclone))
{
static thread_local Stack stack_ = Stack();
return stack_;
It seems to be a bug in gcc that somehow lost track of the static storage property for stack_
when it does the ISRA_CLONE. Disable cloning seems work-around this, although it pays some performance cost. The additional static
is not a must, but since it does not need this
, it's better to be a static to save a little parameter passing. I understand direct use of attribute is not recommended here, just for your reference. Hope this can help a little.
Thanks, @cqhan777, @Pupilsch, this is helpful.
This issue was marked as stale due to lack of activity. It will be closed in 7 days if no furthur activity occurs.
Closed as inactive. Feel free to reopen if this is still an issue.
@cqhan777 @Pupilsch - Is this issue still relevant, and reproducible?
Problem still exists but no longer relevant to us since we have the workaround.
This issue was marked as stale due to lack of activity. It will be closed in 7 days if no furthur activity occurs.
Closed as inactive. Feel free to reopen if this is still an issue.
@Pupilsch Did you figure out how to solve the issue ? I have exactly the same problem on my environment :
- SUSE Linux Enterprise Server 12 SP5
- gcc 8.5.0
- C++ 17
But it's working fine on my Ubuntu 20.04 with gcc 9.4.0 (and also 8.4.0) and C++ 17
@Briac-nt If it is the same problem, the patch from @cqhan777 should work for you:
https://github.com/open-telemetry/opentelemetry-cpp/issues/1014#issuecomment-944821770
Please confirm.
Hi @lalitb , unfortunately this patch doesn't work for me, the trick in the 1223 issue neither.
I changed the Stack &GetStack()
by static Stack &GetStack() __attribute__((noinline, noclone))
and added a console log to be sure the modification was live. The problem is still here, very strange...
I think I have a similar issue, but more complex setup. I'm compiling otel as a static library, but then building a shared library out of that static lib. And it doesn't seem to always lose the stack, only sometimes. Not sure how to provide more info, it's super complex and I only see this happening in my production system
@Falmarri Thanks, can you also add the platform details here - (linux-distro, gcc version, build flags etc)
@lalitb This is built on Ubuntu 18.04.6 LTS
using
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/10.5.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --disable-multilib --enable-default-pie
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.5.0 (GCC)
I'm not entirely sure this is the same issue now that I've looked a bit furhter. What I'm seeing is that the trace is actually being propagated correctly. That is, I do this
auto current_ctx = opentelemetry::context::RuntimeContext::GetCurrent();
propagator->Inject(carrier, current_ctx);
and the remote service correctly gets the right trace/span. So I actually do have the right context. However, the span in question is never actually exported, so that my span never shows up in my collector and the traces are pretty broken.
The reason I think it's related to this though is because it's always this 1 particular span, and this span is in a separate compilation unit from the rest of the shared lib. So how this works is that I have the following:
- statically built otel libraries
- libA.a <-- This includes the otel headers but does NOT link against the libs, and does not use the sdk
- libB.so <-- Links statically against libA.a AND also statically to otel libs. These traces work fine, but the trace defined in libA.a does get propagated, but not reported
I realize this isn't a lot of information but like I said this only happens in our production system and I can't reproduce it on demand or figure out if there's a pattern