Memory corruption when dispatch socket outlives request
I've been observing frequent segfaults when running the datakit filter on a kong/kong:nightly container which uses ngx_wasm_module commit d5e4c698c51a6ef39551184f99eb4d1605f671d9 and a filter configuration that dispatches multiple HTTP subrequests. (I could not reproduce the problem on a configuration with a single HTTP dispatch).
This was one of the segfaults:
I got others with different backtraces, which suggested an earlier memory corruption.
I installed Valgrind in the container and got a ton on memory errors, all which seem to follow from the same cause as the first one:
ngx_wasm_socket_tcp_resume triggers the ngx_http_proxy_wasm_dispatch_resume_handler, but the request data has already been destroyed by ngx_http_free_request.
The invalid data in the Valgrind trace above is call->pwexec, where call comes from sock->data, which is the input argument of the function.
The deallocation at ngx_http_free_request is ngx_destroy_pool for r->pool.
One weird behavior quirk I noticed in the logs before the first Valgrind error is that I did get two call: run lines (which trigger in my filter from on_http_request_headers but only one call: resume line (which triggers from on_http_call_response). The resume operation seems delayed until after ngx_http_free_request already destroyed the main request's data, taking the pwexec context with it.
This is unexpected because in normal operation the filter should be paused waiting for the last dispatch to finish.
Normally, in my local build, I get this sequence of operations for two dispatch calls:
Data for first dispatch arrives:
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket handler (wev: 0)
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket receive handler for "127.0.0.1:6502"
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket trying to receive data (max: 1024)
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming http response reading with 243 bytes to parse
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming http response reading with 226 bytes to parse
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm reuse free buf memory 103 >= 69, cl:00005A8EFF7BA8D0, p:00005A8F001E1FB0
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm allocate new chainlink and new buf of size 69, cl: 00005A8EFFC30520, buf: 00005A8EFFC30560
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket reading done
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket closing
First dispatch resumes:
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_dispatch_response" step in "background" phase
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] DataKitFilter: on http call response, id = 0
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] call: resume
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm_alloc: 1376256:1343312:69
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm_alloc: 1376256:1343440:16
it PAUSEs at the end because there's another one pending:
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm more http dispatch calls pending...
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm] setting next action: pwctx->action = "PAUSE" (pwctx: 00005A8F014EDF00)
Data for second dispatch arrives:
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket handler (wev: 0)
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket receive handler for "127.0.0.1:8008"
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket trying to receive data (max: 1024)
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming http response reading with 200 bytes to parse
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket resuming http response reading with 183 bytes to parse
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm allocate new chainlink and new buf of size 26, cl: 00005A8F01586240, buf: 00005A8EFF866290
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm allocate new chainlink and new buf of size 26, cl: 00005A8F0048BE80, buf: 00005A8F0048BEC0
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket reading done
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm tcp socket closing
second dispatch resumes:
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_dispatch_response" step in "background" phase
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] DataKitFilter: on http call response, id = 1
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] call: resume
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm_alloc: 1376256:1343440:26
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm_alloc: 1376256:1346416:16
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] template: run
2024/04/09 19:45:56 [debug] 1600878#0: *2 wasm allocate new chainlink and new buf of size 56, cl: 00005A8EFFCA3D30, buf: 00005A8EFFCA3D70
done, it sets CONTINUE and the next phases (on_response_headers etc.) continue:
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm last http dispatch call handled
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm] setting next action: pwctx->action = "CONTINUE" (pwctx: 00005A8F014EDF00)
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm] resetting filter chain: pwctx->exec_index 0 to 0 (pwctx: 00005A8F014EDF00)
2024/04/09 19:45:56 [debug] 1600878#0: *2 proxy_wasm return action: "CONTINUE"
2024/04/09 19:45:56 [debug] 1600878#0: *2 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_headers" step in "header_filter" phase
Contrast to what happens in the crashing case:
( wasm tcp socket logs and the like did not show up in the container logs for some reason)
I trigger two calls, with dispatch ids 0 ond 1:
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_request_headers" step in "rewrite" phase
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: run
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm] setting next action: pwctx->action = "PAUSE" (pwctx: 00000000056B6A50)
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: dispatch call id: 0
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: run
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm] setting next action: pwctx->action = "PAUSE" (pwctx: 00000000056B6A50)
2024/04/09 21:56:41 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: dispatch call id: 1
The first one resumes, but CONTINUE is set right away.
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_dispatch_response" step in "background" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] setting next action: pwctx->action = "CONTINUE" (pwctx: 00000000056B6A50)
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] DataKitFilter: on http call response, id = 0
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] call: resume
2024/04/09 21:56:42 [debug] 1321#0: *645 [lua] init.lua:1362: balancer(): setting address (try 1): 93.184.216.34:80
2024/04/09 21:56:42 [debug] 1321#0: *645 [lua] init.lua:1395: balancer(): enabled connection keepalive (pool=93.184.216.34|80, pool_size=512, idle_timeout=60, max_requests=10000)
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] resetting filter chain: pwctx->exec_index 0 to 0 (pwctx: 00000000056B6A50)
This causes all other phases to run and the filter context to be freed.
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_headers" step in "header_filter" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_body" step in "body_filter" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_body" step in "body_filter" phase
172.17.0.1 - - [09/Apr/2024:21:56:42 +0000] "GET /demo HTTP/1.1" 404 648 "-" "HTTPie/2.6.0" kong_request_id: "0dcdce20288d1e0b60aee81a85b68465"
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_log" step in "log" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_done" step in "done" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 finalizing context
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] "datakit_filter" filter freeing context #1 (1/1)
2024/04/09 21:56:42 [info] 1321#0: *645 client 172.17.0.1 closed keepalive connection
And then when the resume handler runs, it's operating on dead data:
==1321== Invalid read of size 8
==1321== at 0x5D21811: ngx_http_proxy_wasm_dispatch_resume_handler (ngx_http_proxy_wasm_dispatch.c:708)
This seems to indicate that the crash is caused by something being scheduled out-of-order earlier on (rather than something like "this piece of data is allocated with a pool with the wrong lifetime").
I also ran my local Gateway builds this morning with Valgrind and the same filter configuration as above to see if it would report something, even if it wasn't crashing, and no memory access violations reported there.