opentelemetry-cpp-contrib icon indicating copy to clipboard operation
opentelemetry-cpp-contrib copied to clipboard

Nginx worker thread dies in internal requests

Open rcjsuen opened this issue 2 years ago • 13 comments

Describe your environment

  • OpenTelemetry C++ 1.2.0
  • OpenTelemetry C++ Contrib at 3808bc3be386fbcc58d39cd858fb375d7e1fafa3
  • Nginx 1.19.9

I haven't narrowed down how to reproduce the issue yet but I'm getting a segfault in the code that was added in #94. Looks like the insert call on line 402 is bad. :(

https://github.com/open-telemetry/opentelemetry-cpp-contrib/blob/3808bc3be386fbcc58d39cd858fb375d7e1fafa3/instrumentation/nginx/src/otel_ngx_module.cpp#L384-L404

#0  0x00007f7aa2425d6a in std::_Hashtable<ngx_http_request_s*, std::pair<ngx_http_request_s* const, TraceContext*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> >, std::__detail::_Select1st, std::equal_to<ngx_http_request_s*>, std::hash<ngx_http_request_s*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_bucket_index (this=0x0, __k=@0x7f7a968c5cb8: 0x7f7a9e3aaa30, __c=140164617382448) at /usr/include/c++/10.3.1/bits/hashtable.h:714
#1  0x00007f7aa2425f81 in std::_Hashtable<ngx_http_request_s*, std::pair<ngx_http_request_s* const, TraceContext*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> >, std::__detail::_Select1st, std::equal_to<ngx_http_request_s*>, std::hash<ngx_http_request_s*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<std::pair<ngx_http_request_s* const, TraceContext*> > (this=0x0) at /usr/include/c++/10.3.1/bits/hashtable.h:1711
#2  0x00007f7aa241eefa in std::__detail::_Insert<ngx_http_request_s*, std::pair<ngx_http_request_s* const, TraceContext*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> >, std::__detail::_Select1st, std::equal_to<ngx_http_request_s*>, std::hash<ngx_http_request_s*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, false>::insert<std::pair<ngx_http_request_s* const, TraceContext*>, void> (this=0x0, __v=...) at /usr/include/c++/10.3.1/bits/hashtable_policy.h:1021
#3  0x00007f7aa2418cac in std::unordered_map<ngx_http_request_s*, TraceContext*, std::hash<ngx_http_request_s*>, std::equal_to<ngx_http_request_s*>, std::allocator<std::pair<ngx_http_request_s* const, TraceContext*> > >::insert (this=0x0, __x=...) at /usr/include/c++/10.3.1/bits/unordered_map.h:587
#4  0x00007f7aa240c6cb in CreateTraceContext (req=0x7f7a9e3aaa30, val=0x7f7a968ee240) at /opentelemetry-cpp-contrib/instrumentation/nginx/src/otel_ngx_module.cpp:402
#5  0x00007f7aa240c7c5 in StartNgxSpan (req=0x7f7a9e3aaa30) at /opentelemetry-cpp-contrib/instrumentation/nginx/src/otel_ngx_module.cpp:418
#6  0x000055ed6f3beac5 in ngx_http_core_rewrite_phase (r=0x7f7a9e3aaa30, ph=0x7f7a9e6a3088) at src/http/ngx_http_core_module.c:922
#7  0x000055ed6f3b9d49 in ngx_http_core_run_phases (r=0x7f7a9e3aaa30) at src/http/ngx_http_core_module.c:868
#8  0x000055ed6f3b9e1a in ngx_http_handler (r=<optimized out>) at src/http/ngx_http_core_module.c:851
#9  0x000055ed6f3c3a2c in ngx_http_run_posted_requests (c=c@entry=0x7f7a97434298) at src/http/ngx_http_request.c:2435
#10 0x000055ed6f3c6ebe in ngx_http_process_request_headers (rev=rev@entry=0x7f7a972b2670) at src/http/ngx_http_request.c:1529
#11 0x000055ed6f3c72b2 in ngx_http_process_request_line (rev=rev@entry=0x7f7a972b2670) at src/http/ngx_http_request.c:1175
#12 0x000055ed6f3c74e1 in ngx_http_wait_request_handler (rev=0x7f7a972b2670) at src/http/ngx_http_request.c:503
#13 0x000055ed6f3c78f9 in ngx_http_ssl_handshake_handler (c=0x7f7a97434298) at src/http/ngx_http_request.c:848
#14 0x000055ed6f3aeaa4 in ngx_ssl_handshake_handler (ev=0x7f7a972b2670) at src/event/ngx_event_openssl.c:2095
#15 0x000055ed6f3a9103 in ngx_epoll_process_events (cycle=0x7f7aa2cc8580, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#16 0x000055ed6f39c140 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f7aa2cc8580) at src/event/ngx_event.c:257
#17 0x000055ed6f3a67c8 in ngx_worker_process_cycle (cycle=0x7f7aa2cc8580, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#18 0x000055ed6f3a4850 in ngx_spawn_process (cycle=cycle@entry=0x7f7aa2cc8580, proc=proc@entry=0x55ed6f3a66bf <ngx_worker_process_cycle>, data=data@entry=0x0, name=name@entry=0x55ed6f4cbf97 "worker process",
    respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#19 0x000055ed6f3a549e in ngx_start_worker_processes (cycle=cycle@entry=0x7f7aa2cc8580, n=8, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#20 0x000055ed6f3a7369 in ngx_master_process_cycle (cycle=0x7f7aa2cc8580, cycle@entry=0x7f7aa2ccd210) at src/os/unix/ngx_process_cycle.c:234
#21 0x000055ed6f379af9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

rcjsuen avatar Feb 11 '22 12:02 rcjsuen

Since this=0x0, it seems like it is a subrequest but its parent request does not have an unordered_map.

rcjsuen avatar Feb 11 '22 14:02 rcjsuen

I've narrowed things down to an ngx.location.capture call inside an access_by_lua_block...

rcjsuen avatar Feb 11 '22 22:02 rcjsuen

@seemk I wonder if the proper fix is to use ngx_http_set_ctx like OpenTracing. What do you think?

rcjsuen avatar Feb 12 '22 12:02 rcjsuen

This config will cause the worker thread to crash.

load_module        /etc/nginx/modules/otel_ngx_module.so;
worker_processes   auto;
error_log          /var/log/nginx/error.log debug;
working_directory  /tmp/cores/;
worker_rlimit_core 500M;

events {
}
http {
  opentelemetry_config /otel-nginx.toml;

  server {
    listen 80;
    server_name otel_example;

    location = / {
      content_by_lua_block {
        ngx.location.capture("/abc")
      }
    }
  }
}

rcjsuen avatar Feb 12 '22 17:02 rcjsuen

Here is a complete setup for reproducing the worker thread crash.

load_module        /etc/nginx/modules/otel_ngx_module.so;
worker_processes   auto;
error_log          /var/log/nginx/error.log debug;
working_directory  /tmp/cores/;
worker_rlimit_core 500M;

events {
}
http {
  opentelemetry_config /otel-nginx.toml;

  server {
    listen 80;
    server_name otel_example;

    location = / {
      content_by_lua_block {
        ngx.location.capture("/abc")
      }
    }
  }
}
exporter = "otlp"
processor = "simple"

[exporters.otlp]
host = "otel"
port = 4317

[service]
name = "nginx-test-crash"
# Using the SHA256 hash for reproducibility,
# it is the fabiocicerchia/nginx-lua:1.21.6-alpine3.15.0 tag
FROM fabiocicerchia/nginx-lua@sha256:5ca81313baa0d3d90fda9a8c48e6adc9028f64beeee4c31618a071dbeaed05cc
RUN apk update && apk add alpine-sdk build-base cmake linux-headers libressl-dev pcre-dev zlib-dev \
      grpc-dev curl-dev protobuf-dev c-ares-dev re2-dev gdb

ENV OPENTELEMETRY_VERSION v1.2.0

RUN cd / && git clone --shallow-submodules --depth 1 --recurse-submodules -b ${OPENTELEMETRY_VERSION} \
  https://github.com/open-telemetry/opentelemetry-cpp.git
RUN cd / && git clone https://github.com/open-telemetry/opentelemetry-cpp-contrib.git \
  && cd /opentelemetry-cpp-contrib \
  && git checkout 3808bc3be386fbcc58d39cd858fb375d7e1fafa3

RUN cd /opentelemetry-cpp \
  && mkdir build \
  && cd build \
  && cmake -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=/install \
    -DCMAKE_PREFIX_PATH=/install \
    -DWITH_OTLP=ON \
    -DWITH_OTLP_GRPC=ON \
    -DWITH_OTLP_HTTP=OFF \
    -DBUILD_TESTING=OFF \
    -DWITH_EXAMPLES=OFF \
    -DWITH_ABSEIL=ON \
    -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
    .. \
  && make -j2 \
  && make install

RUN cd /opentelemetry-cpp-contrib/instrumentation/nginx \
  && mkdir build \
  && cd build \
  && cmake -DCMAKE_BUILD_TYPE=Release \
    -DNGINX_BIN=/usr/sbin/nginx \
    -DCMAKE_PREFIX_PATH=/install \
    -DCMAKE_INSTALL_PREFIX=/etc/nginx/modules \
    -DCURL_LIBRARY=/usr/lib/libcurl.so.4 \
    -DNGINX_VERSION=1.21.6\
    .. \
  && make -j2 \
  && make install
  1. Download all of the above files as nginx.conf, otel-nginx.toml, and Dockerfile and put them all in the same folder.
  2. Build the Docker image.
docker build -t nginx:fabio-1.21.6-alpine3.15.0 .
  1. Run the Docker image in one terminal.
docker run \
    --name nginx-lua-crash \
    -v $(pwd):/tmp/cores \
    -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf \
    -v $(pwd)/otel-nginx.toml:/otel-nginx.toml \
    -it \
    --rm \
    nginx:fabio-1.21.6-alpine3.15.0
  1. Exec into the Docker container in another terminal.
docker exec -it nginx-lua-crash /bin/sh
  1. Use curl to crash the worker thread.
curl localhost:80
  1. Use gdb and confirm the backtrace with bt.
gdb /usr/sbin/nginx /tmp/cores/core
  1. Edit the file and comment out the ngx.location.capture call on line 18.
vi /etc/nginx/nginx.conf
  1. Reload Nginx so that it will load the updated configuration file.
nginx -s reload
  1. The curl will now be successful.
curl localhost:80

rcjsuen avatar Feb 12 '22 21:02 rcjsuen

Same problem happens with internal requests. :(

load_module        /etc/nginx/modules/otel_ngx_module.so;
worker_processes   auto;
error_log          /var/log/nginx/error.log debug;
working_directory  /tmp/cores/;
worker_rlimit_core 500M;

events {
}
http {
  opentelemetry_config /otel-nginx.toml;

  server {
    listen 80;
    server_name otel_example;

    location = / {
      internal;
    }
  }
}

rcjsuen avatar Mar 03 '22 20:03 rcjsuen

From what I can tell, if you try to hit a path that is internal, only FinishNgxSpan is called but not StartNgxSpan so it (unsurprisingly) crashes.

rcjsuen avatar Mar 04 '22 15:03 rcjsuen

@rcjsuen PR (fix(nginx): Fix segfault in GetTraceContext #153 ) has already merged . I test my issue #134 , it seems all right. have you test yours?

domyway avatar May 08 '22 09:05 domyway

have you test yours?

Not yet. I will try to take a look this week.

rcjsuen avatar May 09 '22 12:05 rcjsuen

@domyway I gave 2619b5e0edf56872d5883325a3079c13c25dceac a try. From some simple testing looks like the ngx.location.capture case is good but the internal case is not.

Core was generated by `nginx: worker process'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f2c1fd59f55 in GetTraceContext(ngx_http_request_s*) () from /etc/nginx/modules/otel_ngx_module.so
[Current thread is 1 (LWP 28)]
(gdb) bt
#0  0x00007f2c1fd59f55 in GetTraceContext(ngx_http_request_s*) () from /etc/nginx/modules/otel_ngx_module.so
#1  0x00007f2c1fd5ad58 in FinishNgxSpan(ngx_http_request_s*) () from /etc/nginx/modules/otel_ngx_module.so
#2  0x000055a135fa1f40 in ngx_http_log_request (r=r@entry=0x7f2c1f3023b0) at src/http/ngx_http_request.c:3753
#3  0x000055a135fa38a7 in ngx_http_free_request (r=r@entry=0x7f2c1f3023b0, rc=rc@entry=0) at src/http/ngx_http_request.c:3699
#4  0x000055a135fa4123 in ngx_http_set_keepalive (r=0x7f2c1f3023b0) at src/http/ngx_http_request.c:3122
#5  ngx_http_finalize_connection (r=r@entry=0x7f2c1f3023b0) at src/http/ngx_http_request.c:2774
#6  0x000055a135fa4b6a in ngx_http_finalize_connection (r=<optimized out>) at src/http/ngx_http_request.c:2662
#7  0x000055a135f9f5ed in ngx_http_core_find_config_phase (r=0x7f2c1f3023b0, ph=<optimized out>) at src/http/ngx_http_core_module.c:970
#8  0x000055a135f9a8ad in ngx_http_core_run_phases (r=0x7f2c1f3023b0) at src/http/ngx_http_core_module.c:875
#9  0x000055a135fa5e37 in ngx_http_process_request_headers (rev=<optimized out>) at src/http/ngx_http_request.c:1498
#10 0x000055a135fa6234 in ngx_http_process_request_line (rev=0x7f2c1f1eb740) at src/http/ngx_http_request.c:1165
#11 0x000055a135f8bc83 in ngx_epoll_process_events (cycle=<optimized out>, timer=<optimized out>, flags=1)
    at src/event/modules/ngx_epoll_module.c:901
#12 0x000055a135f816e3 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f2c1fded220) at src/event/ngx_event.c:248
--Type <RET> for more, q to quit, c to continue without paging--
#13 0x000055a135f8980b in ngx_worker_process_cycle (cycle=cycle@entry=0x7f2c1fded220, data=data@entry=0x0)
    at src/os/unix/ngx_process_cycle.c:721
#14 0x000055a135f87e7b in ngx_spawn_process (cycle=cycle@entry=0x7f2c1fded220,
    proc=proc@entry=0x55a135f89790 <ngx_worker_process_cycle>, data=data@entry=0x0, name=name@entry=0x55a13609fa75 "worker process",
    respawn=respawn@entry=-3) at src/os/unix/ngx_process.c:199
#15 0x000055a135f88ccc in ngx_start_worker_processes (cycle=cycle@entry=0x7f2c1fded220, n=8, type=type@entry=-3)
    at src/os/unix/ngx_process_cycle.c:344
#16 0x000055a135f8a1de in ngx_master_process_cycle (cycle=0x7f2c1fded220) at src/os/unix/ngx_process_cycle.c:130
#17 0x000055a135f5ff6f in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:383

rcjsuen avatar May 13 '22 12:05 rcjsuen

127.0.0.1 - - [13/May/2022:12:27:21 +0000] "GET /internal HTTP/1.1" 404 153 "-" "curl/7.80.0"
2022/05/13 12:27:22 [notice] 1#1: signal 17 (SIGCHLD) received from 32
2022/05/13 12:27:22 [alert] 1#1: worker process 32 exited on signal 11 (core dumped)
2022/05/13 12:27:22 [notice] 1#1: start worker process 65
2022/05/13 12:27:22 [notice] 1#1: signal 29 (SIGIO) received

rcjsuen avatar May 13 '22 12:05 rcjsuen

We also hit this issue with the following config:

location /analytics/ {
	opentelemetry_operation_name analytics;
	opentelemetry_propagate;
	add_header X-API-Correlation-Id $opentelemetry_trace_id always;
	
	proxy_pass http://nginx-2;
}

Sending a request to /analytics triggers a 301 response to /analytics/ as described in the docs:

If a location is defined by a prefix string that ends with the slash character, and requests are processed by one of proxy_pass, fastcgi_pass, uwsgi_pass, scgi_pass, memcached_pass, or grpc_pass, then the special processing is performed. In response to a request with URI equal to this string, but without the trailing slash, a permanent redirect with the code 301 will be returned to the requested URI with the slash appended.

Moreover, using the trace id variable in the add_header directive causes the request to completely fail without response. Curl outputs the following:

curl http://localhost:8081/analytics -v
*   Trying 127.0.0.1:8081...
* Connected to localhost (127.0.0.1) port 8081 (#0)
> GET /analytics HTTP/1.1
> Host: localhost:8081
> User-Agent: curl/7.79.1
> Accept: */*
> 
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server

Symbianx avatar Jul 13 '22 12:07 Symbianx

Was able to fix the reproduction use cases described here with #190.

Symbianx avatar Jul 19 '22 15:07 Symbianx