Memory corruption when dispatch socket outlives request

Open hishamhm opened this issue 1 year ago • 1 comments

I've been observing frequent segfaults when running the datakit filter on a kong/kong:nightly container which uses ngx_wasm_module commit d5e4c698c51a6ef39551184f99eb4d1605f671d9 and a filter configuration that dispatches multiple HTTP subrequests. (I could not reproduce the problem on a configuration with a single HTTP dispatch).

This was one of the segfaults:

``` Program received signal SIGSEGV, Segmentation fault. 0x000057f6f83fade1 in ngx_palloc_large (pool=pool@entry=0x57f6fa0e2710, size=size@entry=1120) at src/core/ngx_palloc.c:244 244 src/core/ngx_palloc.c: No such file or directory. (gdb) bt #0 0x000057f6f83fade1 in ngx_palloc_large (pool=pool@entry=0x57f6fa0e2710, size=size@entry=1120) at src/core/ngx_palloc.c:244 #1 0x000057f6f83fafdf in ngx_palloc (pool=pool@entry=0x57f6fa0e2710, size=size@entry=1120) at src/core/ngx_palloc.c:131 #2 0x00007723db389217 in ngx_list_init (size=56, n=20, pool=0x57f6fa0e2710, list=0x57f6f9f32ec0) at src/core/ngx_list.h:39 #3 ngx_wasm_socket_read_http_response (sock=0x57f6f9f328e8, bytes=539, ctx=0x57f6f9f32cc8) at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/common/ngx_wasm_socket_tcp.c:968 #4 0x00007723db3894ba in ngx_wasm_socket_tcp_read (sock=sock@entry=0x57f6f9f328e8, reader=0x7723db389150 , reader_ctx=reader_ctx@entry=0x57f6f9f32cc8) at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/common/ngx_wasm_socket_tcp.c:1064 #5 0x00007723db39b934 in ngx_http_proxy_wasm_dispatch_resume_handler (sock=) at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/http/proxy_wasm/ngx_http_proxy_wasm_dispatch.c:785 #6 0x00007723db387f1f in ngx_wasm_socket_tcp_resume (sock=) at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/common/ngx_wasm_socket_tcp.c:90 #7 ngx_wasm_socket_tcp_resume (sock=) at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/common/ngx_wasm_socket_tcp.c:76 #8 0x000057f6f8425a6f in ngx_epoll_process_events (cycle=, timer=, flags=) at src/event/modules/ngx_epoll_module.c:901 #9 0x000057f6f841b79a in ngx_process_events_and_timers (cycle=cycle@entry=0x57f6f87991b0) at src/event/ngx_event.c:258 #10 0x000057f6f842384e in ngx_worker_process_cycle (cycle=cycle@entry=0x57f6f87991b0, data=data@entry=0x0) at src/os/unix/ngx_process_cycle.c:793 #11 0x000057f6f8421c0f in ngx_spawn_process (cycle=cycle@entry=0x57f6f87991b0, proc=0x57f6f84237d0 , data=0x0, name=0x57f6f85f4004 "worker process", respawn=respawn@entry=0) at src/os/unix/ngx_process.c:199 #12 0x000057f6f8424a10 in ngx_reap_children (cycle=0x57f6f87991b0) at src/os/unix/ngx_process_cycle.c:665 #13 ngx_master_process_cycle (cycle=0x57f6f87991b0) at src/os/unix/ngx_process_cycle.c:180 #14 0x000057f6f83f884e in main (argc=, argv=) at src/core/nginx.c:387 ```

I got others with different backtraces, which suggested an earlier memory corruption.

I installed Valgrind in the container and got a ton on memory errors, all which seem to follow from the same cause as the first one:

``` 2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] resetting filter chain: pwctx->exec_index 0 to 0 (pwctx: 00000000056B6A50) 2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_headers" step in "header_filter" phase 2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_body" step in "body_filter" phase 2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_body" step in "body_filter" phase 172.17.0.1 - - [09/Apr/2024:21:56:42 +0000] "GET /demo HTTP/1.1" 404 648 "-" "HTTPie/2.6.0" kong_request_id: "0dcdce20288d1e0b60aee81a85b68465" 2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_log" step in "log" phase 2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_done" step in "done" phase 2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 finalizing context 2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] "datakit_filter" filter freeing context #1 (1/1) 2024/04/09 21:56:42 [info] 1321#0: *645 client 172.17.0.1 closed keepalive connection ==1321== Invalid read of size 8 ==1321== at 0x5D21811: ngx_http_proxy_wasm_dispatch_resume_handler (ngx_http_proxy_wasm_dispatch.c:708) ==1321== by 0x5D0DF1E: ngx_wasm_socket_tcp_resume (ngx_wasm_socket_tcp.c:90) ==1321== by 0x5D0DF1E: ngx_wasm_socket_tcp_resume (ngx_wasm_socket_tcp.c:76) ==1321== by 0x177A6E: ngx_epoll_process_events (ngx_epoll_module.c:901) ==1321== by 0x16D799: ngx_process_events_and_timers (ngx_event.c:258) ==1321== by 0x17584D: ngx_worker_process_cycle (ngx_process_cycle.c:793) ==1321== by 0x173C0E: ngx_spawn_process (ngx_process.c:199) ==1321== by 0x174F0F: ngx_start_worker_processes (ngx_process_cycle.c:382) ==1321== by 0x1762B3: ngx_master_process_cycle (ngx_process_cycle.c:135) ==1321== by 0x14A84D: main (nginx.c:387) ==1321== Address 0x6fd9260 is 2,704 bytes inside a block of size 4,096 free'd ==1321== at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==1321== by 0x14CF36: ngx_destroy_pool (ngx_palloc.c:90) ==1321== by 0x18FF2C: ngx_http_free_request (ngx_http_request.c:3771) ==1321== by 0x190B53: ngx_http_set_keepalive (ngx_http_request.c:3151) ==1321== by 0x190B53: ngx_http_finalize_connection.part.0 (ngx_http_request.c:2788) ==1321== by 0x1A24C1: ngx_http_upstream_handler (ngx_http_upstream.c:1309) ==1321== by 0x177A6E: ngx_epoll_process_events (ngx_epoll_module.c:901) ==1321== by 0x16D799: ngx_process_events_and_timers (ngx_event.c:258) ==1321== by 0x17584D: ngx_worker_process_cycle (ngx_process_cycle.c:793) ==1321== by 0x173C0E: ngx_spawn_process (ngx_process.c:199) ==1321== by 0x174F0F: ngx_start_worker_processes (ngx_process_cycle.c:382) ==1321== by 0x1762B3: ngx_master_process_cycle (ngx_process_cycle.c:135) ==1321== by 0x14A84D: main (nginx.c:387) ==1321== Block was alloc'd at ==1321== at 0x484DE30: memalign (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==1321== by 0x484DF92: posix_memalign (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==1321== by 0x1711BA: ngx_memalign (ngx_alloc.c:57) ==1321== by 0x14CC5A: ngx_palloc_block (ngx_palloc.c:186) ==1321== by 0x14D195: ngx_pcalloc (ngx_palloc.c:302) ==1321== by 0x18F393: ngx_http_alloc_request (ngx_http_request.c:622) ==1321== by 0x18F814: ngx_http_create_request (ngx_http_request.c:534) ==1321== by 0x192D81: ngx_http_wait_request_handler (ngx_http_request.c:516) ==1321== by 0x177A6E: ngx_epoll_process_events (ngx_epoll_module.c:901) ==1321== by 0x16D799: ngx_process_events_and_timers (ngx_event.c:258) ==1321== by 0x17584D: ngx_worker_process_cycle (ngx_process_cycle.c:793) ==1321== by 0x173C0E: ngx_spawn_process (ngx_process.c:199) ==1321== ```

ngx_wasm_socket_tcp_resume triggers the ngx_http_proxy_wasm_dispatch_resume_handler, but the request data has already been destroyed by ngx_http_free_request.

The invalid data in the Valgrind trace above is call->pwexec, where call comes from sock->data, which is the input argument of the function.

The deallocation at ngx_http_free_request is ngx_destroy_pool for r->pool.

One weird behavior quirk I noticed in the logs before the first Valgrind error is that I did get two call: run lines (which trigger in my filter from on_http_request_headers but only one call: resume line (which triggers from on_http_call_response). The resume operation seems delayed until after ngx_http_free_request already destroyed the main request's data, taking the pwexec context with it.

This is unexpected because in normal operation the filter should be paused waiting for the last dispatch to finish.

Normally, in my local build, I get this sequence of operations for two dispatch calls:

Data for first dispatch arrives:

2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket handler (wev: 0)
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket receive handler for "127.0.0.1:6502"
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket trying to receive data (max: 1024)
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming http response reading with 243 bytes to parse
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming http response reading with 226 bytes to parse
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm reuse free buf memory 103 >= 69, cl:00005A8EFF7BA8D0, p:00005A8F001E1FB0
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm allocate new chainlink and new buf of size 69, cl: 00005A8EFFC30520, buf: 00005A8EFFC30560
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket reading done
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket closing

First dispatch resumes:

2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_dispatch_response" step in "background" phase
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] DataKitFilter: on http call response, id = 0
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] call: resume
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm_alloc: 1376256:1343312:69
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm_alloc: 1376256:1343440:16

it PAUSEs at the end because there's another one pending:

2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm more http dispatch calls pending...
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm] setting next action: pwctx->action = "PAUSE" (pwctx: 00005A8F014EDF00)

Data for second dispatch arrives:

2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket handler (wev: 0)
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket receive handler for "127.0.0.1:8008"
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket trying to receive data (max: 1024)
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming http response reading with 200 bytes to parse
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming http response reading with 183 bytes to parse
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm allocate new chainlink and new buf of size 26, cl: 00005A8F01586240, buf: 00005A8EFF866290
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm allocate new chainlink and new buf of size 26, cl: 00005A8F0048BE80, buf: 00005A8F0048BEC0
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket reading done
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket closing

second dispatch resumes:

2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_dispatch_response" step in "background" phase
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] DataKitFilter: on http call response, id = 1
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] call: resume
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm_alloc: 1376256:1343440:26
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm_alloc: 1376256:1346416:16
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] template: run
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm allocate new chainlink and new buf of size 56, cl: 00005A8EFFCA3D30, buf: 00005A8EFFCA3D70

done, it sets CONTINUE and the next phases (on_response_headers etc.) continue:

2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm last http dispatch call handled
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm] setting next action: pwctx->action = "CONTINUE" (pwctx: 00005A8F014EDF00)
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm] resetting filter chain: pwctx->exec_index 0 to 0 (pwctx: 00005A8F014EDF00)
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm return action: "CONTINUE"
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_headers" step in "header_filter" phase

Contrast to what happens in the crashing case:

( wasm tcp socket logs and the like did not show up in the container logs for some reason)

I trigger two calls, with dispatch ids 0 ond 1:

2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_request_headers" step in "rewrite" phase
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: run
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm] setting next action: pwctx->action = "PAUSE" (pwctx: 00000000056B6A50)
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: dispatch call id: 0
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: run
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm] setting next action: pwctx->action = "PAUSE" (pwctx: 00000000056B6A50)
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: dispatch call id: 1

The first one resumes, but CONTINUE is set right away.

2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_dispatch_response" step in "background" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] setting next action: pwctx->action = "CONTINUE" (pwctx: 00000000056B6A50)
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] DataKitFilter: on http call response, id = 0
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: resume
2024/04/09 21:56:42 [debug] 1321#0: *645 [lua] init.lua:1362: balancer(): setting address (try 1): 93.184.216.34:80
2024/04/09 21:56:42 [debug] 1321#0: *645 [lua] init.lua:1395: balancer(): enabled connection keepalive (pool=93.184.216.34|80, pool_size=512, idle_timeout=60, max_requests=10000)
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] resetting filter chain: pwctx->exec_index 0 to 0 (pwctx: 00000000056B6A50)

This causes all other phases to run and the filter context to be freed.

2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_headers" step in "header_filter" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_body" step in "body_filter" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_body" step in "body_filter" phase
172.17.0.1 - - [09/Apr/2024:21:56:42 +0000] "GET /demo HTTP/1.1" 404 648 "-" "HTTPie/2.6.0" kong_request_id: "0dcdce20288d1e0b60aee81a85b68465"
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_log" step in "log" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_done" step in "done" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 finalizing context
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] "datakit_filter" filter freeing context #1 (1/1)
2024/04/09 21:56:42 [info] 1321#0: *645 client 172.17.0.1 closed keepalive connection

And then when the resume handler runs, it's operating on dead data:

==1321== Invalid read of size 8
==1321==    at 0x5D21811: ngx_http_proxy_wasm_dispatch_resume_handler (ngx_http_proxy_wasm_dispatch.c:708)

This seems to indicate that the crash is caused by something being scheduled out-of-order earlier on (rather than something like "this piece of data is allocated with a pool with the wrong lifetime").

full_log.txt

Apr 09 '24 23:04 hishamhm

I also ran my local Gateway builds this morning with Valgrind and the same filter configuration as above to see if it would report something, even if it wasn't crashing, and no memory access violations reported there.

Apr 10 '24 14:04 hishamhm