opentelemetry-cpp-contrib icon indicating copy to clipboard operation
opentelemetry-cpp-contrib copied to clipboard

Segfault in GetTraceContext with named @healthcheck location

Open sfc-gh-jallie opened this issue 3 years ago • 2 comments

I am experiencing a reproducible crash in nginx workers when opentelemetry_propagate is set, and a @healthcheck named location is defined.

Removing either from the config avoids the crash.

On a fresh build from head, I've managed to catch the crash in gdb, with the following stack

#0  0x00007ffff46e862c in std::_Hashtable<ngx_http_request_s*, std::pair<ngx_http_request_s* const, TraceContext*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> >, std::__detail::_Select1st, std::equal_to<ngx_http_request_s*>, std::hash<ngx_http_request_s*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_bucket_index (this=0x0, __k=@0x7fffffffd8e8: 0x555555940d80, __c=93824996347264) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/hashtable.h:643
#1  0x00007ffff46e2288 in std::_Hashtable<ngx_http_request_s*, std::pair<ngx_http_request_s* const, TraceContext*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> >, std::__detail::_Select1st, std::equal_to<ngx_http_request_s*>, std::hash<ngx_http_request_s*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::find (this=0x0, __k=@0x7fffffffd8e8: 0x555555940d80) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/hashtable.h:1440
#2  0x00007ffff46dca09 in std::unordered_map<ngx_http_request_s*, TraceContext*, std::hash<ngx_http_request_s*>, std::equal_to<ngx_http_request_s*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> > >::find (this=0x0, __x=@0x7fffffffd8e8: 0x555555940d80)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/unordered_map.h:921
#3  0x00007ffff46d07c5 in GetTraceContext (req=0x555555940d80)
    at /tmp/tmp.cDN5gHeZCb/.build/opentelemetry-nginx-prefix/src/opentelemetry-nginx/instrumentation/nginx/src/otel_ngx_module.cpp:182
#4  0x00007ffff46d08d9 in OtelGetTraceContextVar (req=0x555555940d80, v=0x555555941bb0, data=93824997472152)
    at /tmp/tmp.cDN5gHeZCb/.build/opentelemetry-nginx-prefix/src/opentelemetry-nginx/instrumentation/nginx/src/otel_ngx_module.cpp:202
#5  0x00005555555da1b1 in ngx_http_get_indexed_variable ()
#6  0x00005555555db1a1 in ngx_http_script_copy_var_len_code ()
#7  0x0000555555624dcd in ngx_http_proxy_create_request ()
#8  0x00005555555e46a3 in ngx_http_upstream_init_request ()
#9  0x00005555555d76e2 in ngx_http_read_client_request_body ()
#10 0x00005555556274fe in ngx_http_proxy_handler ()
#11 0x00005555556569d2 in ngx_http_upstream_hc_handler ()
#12 0x00005555555aab5d in ngx_event_expire_timers ()
#13 0x00005555555aa6b3 in ngx_process_events_and_timers ()
#14 0x00005555555b2471 in ngx_worker_process_cycle ()
#15 0x00005555555b0cdb in ngx_spawn_process ()
#16 0x00005555555b2860 in ngx_start_worker_processes ()
#17 0x00005555555b3243 in ngx_master_process_cycle ()
#18 0x0000555555586edd in main ()

Looking at the context in frame #3

(gdb) frame 3
#3  0x00007ffff46d07c5 in GetTraceContext (req=0x555555940d80)
    at /tmp/tmp.cDN5gHeZCb/.build/opentelemetry-nginx-prefix/src/opentelemetry-nginx/instrumentation/nginx/src/otel_ngx_module.cpp:182
182     in /tmp/tmp.cDN5gHeZCb/.build/opentelemetry-nginx-prefix/src/opentelemetry-nginx/instrumentation/nginx/src/otel_ngx_module.cpp
(gdb) info locals
val = 0x555555941a30
map = 0x0
it = {<std::__detail::_Node_iterator_base<std::pair<ngx_http_request_s* const, TraceContext*>, false>> = {_M_cur = 0x7fffffffddb8}, <No data fields>}

Looking at the calling code in otel_ngx_module.cpp

 std::unordered_map<ngx_http_request_t*, TraceContext*>* map = (std::unordered_map<ngx_http_request_t*, TraceContext*>*)val->data;
auto it = map->find(req);

It seems like the map has been deleted? (or never initialized?)

I'm not overly familiar with the nginx memory management API, and I'm not entirely certain what the 'right' fix is here. It would be relatively easy to paper over this particular error with a null check on the map, but it's not clear to me if there is some more fundamental problem that deserves a better fix

sfc-gh-jallie avatar Apr 01 '22 18:04 sfc-gh-jallie

@sfc-gh-jallie I suggest you share an nginx.conf that will reproduce the problem. With that in mind though, it looks like you are hitting #105 and #134.

rcjsuen avatar Apr 01 '22 18:04 rcjsuen

@rcjsuen indeed, it seems very likely to be the same issue. I'll try to get a minimal repro nginx.conf

sfc-gh-jallie avatar Apr 01 '22 18:04 sfc-gh-jallie