opentelemetry-cpp-contrib
opentelemetry-cpp-contrib copied to clipboard
Frequent Coredumps in NGINX
Describe your environment
Ubuntu 18.04 NGINX 1.21.6 (built using a modified version of https://launchpad.net/~ondrej/+archive/ubuntu/nginx-mainline -- NGINX backported to Ubuntu 18) GRPC Version v1.36.4 OpenTelemetry-cpp 1.4.0 OpenTelemetry-cpp-contrib main
See https://gist.github.com/dgolombek/0800c77d4a4dd17369fba13ae49d2cd2 for the Dockerfile used to build nginx plugin shared library (taken from NGINX's build scripts, but customized slightly).
Steps to reproduce
We have OpenTelemetry enabled on one route:
opentelemetry on;
opentelemetry_config /etc/proxy/otel-nginx.toml;
opentelemetry_propagate b3;
opentelemetry_operation_name upstream1;
access_log /var/log/nginx/upstream_logging.log upstream_b3_logging;
What is the expected behavior?
OpenTelemetry headers are added and traces are emitted to collector.
What is the actual behavior?
After a variable number of requests, NGINX coredumps. I haven't isolated exact traffic that causes the crash, this happens in a low volume environment (1 request/second) but with a lot of different callers.
# gdb /usr/sbin/nginx CoreDump
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
This GDB was configured as "x86_64-linux-gnu".
Reading symbols from /usr/sbin/nginx...(no debugging symbols found)...done.
[New LWP 28612]
[New LWP 28629]
[New LWP 28625]
[New LWP 28630]
[New LWP 28626]
[New LWP 28627]
[New LWP 28633]
[New LWP 28628]
[New LWP 28634]
[New LWP 28631]
[New LWP 28635]
[New LWP 28632]
[New LWP 28637]
[New LWP 28636]
[New LWP 28679]
[New LWP 28680]
[New LWP 28676]
[New LWP 28681]
[New LWP 28677]
[New LWP 28682]
[New LWP 28678]
[New LWP 28683]
[New LWP 28691]
[New LWP 28684]
[New LWP 28692]
[New LWP 28685]
[New LWP 28693]
[New LWP 28686]
[New LWP 28687]
[New LWP 28694]
[New LWP 28688]
[New LWP 28698]
[New LWP 28689]
[New LWP 28690]
[New LWP 28708]
[New LWP 28695]
[New LWP 28696]
[New LWP 28697]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `nginx: worker process '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f08b9a35530 in std::_Hashtable<ngx_http_request_s*, std::pair<ngx_http_request_s* const, TraceContext*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> >, std::__detail::_Select1st, std::equal_to<ngx_http_request_s*>, std::hash<ngx_http_request_s*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_bucket_index (this=0x0, __k=@0x7ffd2b9ba058: 0x555f852e7da0, __c=93868744670624) at /usr/include/c++/7/bits/hashtable.h:631
631 { return __hash_code_base::_M_bucket_index(__k, __c, _M_bucket_count); }
[Current thread is 1 (Thread 0x7f08c40f3740 (LWP 28612))]
(gdb) bt
#0 0x00007f08b9a35530 in std::_Hashtable<ngx_http_request_s*, std::pair<ngx_http_request_s* const, TraceContext*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> >, std::__detail::_Select1st, std::equal_to<ngx_http_request_s*>, std::hash<ngx_http_request_s*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_bucket_index (this=0x0, __k=@0x7ffd2b9ba058: 0x555f852e7da0, __c=93868744670624) at /usr/include/c++/7/bits/hashtable.h:631
#1 0x00007f08b9a2cd57 in std::_Hashtable<ngx_http_request_s*, std::pair<ngx_http_request_s* const, TraceContext*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> >, std::__detail::_Select1st, std::equal_to<ngx_http_request_s*>, std::hash<ngx_http_request_s*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::find (this=0x0, __k=@0x7ffd2b9ba058: 0x555f852e7da0) at /usr/include/c++/7/bits/hashtable.h:1424
#2 0x00007f08b9a2620f in std::unordered_map<ngx_http_request_s*, TraceContext*, std::hash<ngx_http_request_s*>, std::equal_to<ngx_http_request_s*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> > >::find (this=0x0, __x=@0x7ffd2b9ba058: 0x555f852e7da0) at /usr/include/c++/7/bits/unordered_map.h:920
#3 0x00007f08b9a18769 in GetTraceContext (req=0x555f852e7da0) at /build/opentelemetry-cpp-contrib/instrumentation/nginx/src/otel_ngx_module.cpp:198
#4 0x00007f08b9a19ebe in FinishNgxSpan (req=0x555f852e7da0) at /build/opentelemetry-cpp-contrib/instrumentation/nginx/src/otel_ngx_module.cpp:525
#5 0x0000555f82c905e5 in ?? ()
#6 0x0000555f82c92404 in ngx_http_free_request ()
#7 0x0000555f82c936ef in ?? ()
#8 0x0000555f82c8db34 in ngx_http_core_find_config_phase ()
#9 0x0000555f82c886ed in ngx_http_core_run_phases ()
#10 0x0000555f82c94d2e in ngx_http_process_request ()
#11 0x0000555f82c951d8 in ?? ()
#12 0x0000555f82c957dc in ?? ()
#13 0x0000555f82c7820f in ?? ()
#14 0x0000555f82c6c906 in ngx_process_events_and_timers ()
#15 0x0000555f82c75f15 in ?? ()
#16 0x0000555f82c74741 in ngx_spawn_process ()
#17 0x0000555f82c76204 in ?? ()
#18 0x0000555f82c76f9e in ngx_master_process_cycle ()
#19 0x0000555f82c49569 in main ()
I've seen other backtraces too though, which suggests a use-after-free issue. When I saw this previously (with a Release build), teh second to last stack frame was
0x00007f2fe24df076 in opentelemetry::v1::exporter::otlp::OtlpGrpcExporter::Export(opentelemetry::v1::nostd::span<std::unique_ptr<opentelemetry::v1::sdk::trace::Recordable, std::default_delete<opentelemetry::v1::sdk::trace::Recordable> >, 18446744073709551615ul> const&) () from /usr/share/nginx/modules/otel_ngx_module.so
Unfortunately I lost some of the other stack traces I saw (built with RelWithDbgInfo, which still was missing a lot of frames). My NGINX is built with --with-debug
so I'm not sure why all the NGINX frames don't have symbols -- possibly the build is missing -g in it.
This appears every time I enabled opentelemetry in NGINX, but it varies in how long it takes. Once it was 10 seconds, another time it was 30 minutes. I can't leave it on for long, since it causes requests to be dropped (sometimes the same one repeatedly), breaking other testing.
Additional context
I can try to run nginx with error logs at debug level, but the logs get really big really fast and sometimes it takes 15-30 minutes for a coredump to occur.
Are you using internal
requests? From my testing that is still broken. See https://github.com/open-telemetry/opentelemetry-cpp-contrib/issues/105#issuecomment-1126002534.
No, I don't use any internal
blocks currently. I saw that issue, but it doesn't seem to match what I'm running into.