opentelemetry-cpp icon indicating copy to clipboard operation
opentelemetry-cpp copied to clipboard

Crash in OLTP HTTP export

Open VivekSubr opened this issue 1 year ago • 6 comments

Describe your environment Built and running on linux,

cmake .. -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_STANDARD=17 \
         -DWITH_STL=CXX17 -DBUILD_SHARED_LIBS=ON -DWITH_OTLP_HTTP=ON -DWITH_OTLP_GRPC=ON -DBUILD_TESTING=OFF

Protobuf version installed - 3.17.3

Steps to reproduce Don't have exact steps to reproduce, happens intermittently.

Backtrace

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000001b83c7a7d in ?? ()
[Current thread is 1 (Thread 0x7810b784da00 (LWP 23))]
#0  0x00000001b83c7a7d in ?? ()
#1  0x00007ffc046f68b0 in ?? ()
#2  0x00007810b87a237d in google::protobuf::RepeatedPtrField<opentelemetry::proto::trace::v1::ResourceSpans>::~RepeatedPtrField() ()
   from /lib64/libopentelemetry_exporter_otlp_grpc.so
#3  0x00007810b87a174a in opentelemetry::proto::collector::trace::v1::ExportTraceServiceRequest::~ExportTraceServiceRequest() ()
   from /lib64/libopentelemetry_exporter_otlp_grpc.so
#4  0x00007810b87721ff in opentelemetry::v1::exporter::otlp::OtlpHttpExporter::Export(opentelemetry::v1::nostd::span<std::unique_ptr<--Type <RET> for more, q to quit, c to continue without paging--
opentelemetry::v1::sdk::trace::Recordable, std::default_delete<opentelemetry::v1::sdk::trace::Recordable> >, 18446744073709551615ul> const&) () from /lib64/libopentelemetry_exporter_otlp_http.so
#5  0x00007810ba07113b in opentelemetry::v1::sdk::trace::SimpleSpanProcessor::OnEnd (this=0x6299ed3d66a0, span=...)
    at /usr/include/opentelemetry/sdk/trace/simple_processor.h:51
#6  0x00007810b88cd9ba in opentelemetry::v1::sdk::trace::MultiSpanProcessor::OnEnd(std::unique_ptr<opentelemetry::v1::sdk::trace::Recordable, std::default_delete<opentelemetry::v1::sdk::trace::Recordable> >&&) () from /lib64/libopentelemetry_trace.so
#7  0x00007810b88d6654 in opentelemetry::v1::sdk::trace::Span::End(opentelemetry::v1::trace::EndSpanOptions const&) ()
   from /lib64/libopentelemetry_trace.so

Additional Info

Crash appears to be on destruction of arena object in, https://github.com/open-telemetry/opentelemetry-cpp/blob/main/exporters/otlp/src/otlp_http_exporter.cc#L102

It's not apparent why this might happen... any help will be appreciated.

VivekSubr avatar Jun 21 '24 06:06 VivekSubr

What's your version of otel-cpp and do you enable async exporting? There was a thread safety problem before 1.10.0 in OTLP HTTP exporter when otel-cpp is built without async export(Without -DENABLE_ASYNC_EXPORT or WITH_ASYNC_EXPORT_PREVIEW).

owent avatar Jun 25 '24 04:06 owent

@owent - 1.15, haven't enabled async exporting... is async export still in preview in 1.15?

VivekSubr avatar Jun 26 '24 06:06 VivekSubr

@owent - 1.15, haven't enabled async exporting... is async export still in preview in 1.15?

gRPC async exporting is still in preview.

owent avatar Jun 26 '24 11:06 owent

Does this problem happens when shuting down? Do you compile both otel-cpp and proto as dynamic library?Just wondering why the destructor of RepeatedPtrField<opentelemetry::proto::trace::v1::ResourceSpans> is in gRPC exporter.

owent avatar Jun 29 '24 10:06 owent

It's HTTP exporter, and proto is from yum install.

We're investigating if it's memory corruption from somewhere else.

VivekSubr avatar Jun 29 '24 11:06 VivekSubr

It's HTTP exporter, and proto is from yum install.

We're investigating if it's memory corruption from somewhere else.

Do you mean protobuf? I reviewed the codes and found the messages and arena will not leave the scope of OtlpHttpExporter::Export in my understanding.

owent avatar Jun 29 '24 17:06 owent

I found another crash in #2982 when using metrics and timeout happens. Not sure if it relates this one.

owent avatar Jul 01 '24 07:07 owent

Is there any solutions for this? I'm also facing this SIGSEV.

Using OTEL v1.16.1 , OTLP HTTP Exporter, Batch Processor.

msiddhu avatar Aug 08 '24 16:08 msiddhu

@msiddhu Are you getting this crash during application shutdown? If yes, does doing ForceFlush() before shutdown helps?

lalitb avatar Aug 08 '24 16:08 lalitb

@msiddhu Thanks for the separate confirmation.

Do you have more details, like a call stack ?

Saying "it crashes for me too" gives us next to nothing to work with.

marcalff avatar Aug 08 '24 17:08 marcalff

The part which is really dubious is:

  • a bug report about OTLP HTTP
  • a call stack pointing to libopentelemetry_exporter_otlp_grpc.so

Is this about OTLP HTTP or OLTP GRPC ? Was the application built with OTLP HTTP alone, OTLP GRPC alone, or both ?

marcalff avatar Aug 08 '24 17:08 marcalff

@michalpristas Could you try main branch or #2983 ? Some std::async implementations of STLs may have bugs and crash sometimes, this PR replace these APIs with the more stable one. We don't find more coredumps for servel days after this patch in our system.

owent avatar Aug 10 '24 11:08 owent

We have not observed this crash after removing patch mentioned in https://github.com/open-telemetry/opentelemetry-cpp/issues/2382

The build failure ultimately boiled down to someone having done #define U in another library.

VivekSubr avatar Aug 10 '24 12:08 VivekSubr