feat(*) Lua bridge refactor and dispatch fixes
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads, and refactor how dispatch calls failures to continue executing the request. Replaces #539, #545, #523.
The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via ngx.thread). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup.
List of significant changes for this refactor:
- Breaking: failing dispatch calls do not produce HTTP 500 responses anymore. Instead, the request execution is resumed normally.
- Breaking: the
proxy_wasm.start()FFI function is removed. Onlyproxy_wasm.attach()is now necessary, and the filter chain is only resumed once the ngx_http_wasm_modulerewriteoraccessphases are entered. Prior,proxy_wasm.start()would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - The
ngx.semaphoreAPI can be used in the Lua bridge. The default Lua resolver now has synchronization enabled. - In ngx_wasm_socket_tcp, the
sock->envmember is now a pointer to the request'senvinstead of a copy so as to manipulate theenv->statecontrol variable. - The
wasm_calldirective can now yield, which allows for sanity testing of the Lua bridge yielding functionality. - A new
rctx->resume_handlerpointer holds the resume entry point back from yielding facilities intongx_http_core_run_phases. For now, only the Lua bridge uses it, but other yielding facilities should be refactored to use it so as to factorize our resuming code.
Fix #524 Fix #528
Codecov Report
Attention: Patch coverage is 88.99614% with 57 lines in your changes missing coverage. Please review.
Project coverage is 90.36951%. Comparing base (
277fac6) to head (b9bb644).
Additional details and impacted files
@@ Coverage Diff @@
## main #546 +/- ##
===================================================
+ Coverage 90.05848% 90.36951% +0.31103%
===================================================
Files 47 47
Lines 10089 10311 +222
===================================================
+ Hits 9086 9318 +232
+ Misses 1003 993 -10
| Files | Coverage Δ | |
|---|---|---|
| src/common/lua/ngx_wasm_lua_ffi.c | 92.92035% <100.00000%> (-0.67965%) |
:arrow_down: |
| src/common/ngx_wasm_socket_tcp_readers.c | 87.81513% <100.00000%> (ø) |
|
| src/common/ngx_wasm_subsystem.c | 100.00000% <100.00000%> (ø) |
|
| src/common/proxy_wasm/ngx_proxy_wasm.c | 92.58850% <100.00000%> (+0.04952%) |
:arrow_up: |
| src/common/proxy_wasm/ngx_proxy_wasm.h | 91.89189% <ø> (ø) |
|
| src/common/proxy_wasm/ngx_proxy_wasm_host.c | 94.37751% <100.00000%> (-0.02250%) |
:arrow_down: |
| src/common/proxy_wasm/ngx_proxy_wasm_util.c | 94.14062% <ø> (ø) |
|
| src/http/ngx_http_wasm_util.c | 87.06199% <100.00000%> (+0.07011%) |
:arrow_up: |
| src/http/proxy_wasm/ngx_http_proxy_wasm.c | 93.71728% <100.00000%> (+0.20377%) |
:arrow_up: |
| src/http/proxy_wasm/ngx_http_proxy_wasm_dispatch.c | 91.62562% <100.00000%> (+0.39754%) |
:arrow_up: |
| ... and 8 more |
| Flag | Coverage Δ | |
|---|---|---|
| unit | 90.13820% <88.56502%> (-0.01297%) |
:arrow_down: |
| valgrind | 81.21406% <85.62092%> (+3.19863%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
@hishamhm I would like for the datakit filter to be tested with https://github.com/Kong/ngx_wasm_module/pull/546 before merging it
@thibaultcha I've just tested it and it's still crashing. Running it enough times I do get segfaults, but every run is reporting memory access errors consistently.
These are the logs from a Gateway run with Valgrind. This particular run didn't crash, but Valgrind reported reads on data that was freed by ngx_proxy_wasm_dispatch_calls_cancel:
pr-546-kong-error.log
pr-546-kong-valgrind.log
This was running a datakit configuration with 2 parallel calls that fail (I didn't spin up the other localhost upstream).
@hishamhm What is the specific reproducible example for this? It does not trigger in the test given with this PR, and seems to trigger when send_local_response is used at some point (as per the Valgrind report). I tried updating that test with:
location /t {
proxy_wasm hostcalls 'on=request_body \
test=/t/dispatch_http_call \
host=127.0.0.1:1 \
ncalls=2';
proxy_wasm hostcalls 'on=request_body \
test=/t/echo/body';
echo ok;
}
But it does not trigger any memory issues.
@thibaultcha Sure, let me try to get the same filter crash without the gateway.
Took me a while but I got a consistent repro running without the gateway: https://github.com/Kong/ngx_wasm_module/actions/runs/9133572211/job/25117356096
This branch/commit was intended only for sharing the test case: https://github.com/Kong/ngx_wasm_module/commit/495c13dcd2aaf5273d9e75a121cfec4d8c7793bd (Loading Datakit on Wasmtime with Valgrind takes a couple of minutes, so I enabled the cache in the testcase as a quality-of-life tweak for speeding up debugging runs...)
I think we can eventually isolate the offending proxy-wasm calls and produce a hostcalls-based testcase, but to go one step at a time I wanted to get the error reproduced without the Gateway on Datakit first.
Looking at the logs, I think the triggering condition is to dispatch a call, then in the same handler (in this case, on_request_headers), trigger a local response, before the dispatch callback gets a chance to run.
I think I got a minimized test case. Branch tests/interrupted-dispatch on top of this PR — this test: https://github.com/Kong/ngx_wasm_module/commit/6cb6899c75b70370fac89f5dc47e246ef5c19ab7
Still waiting for the CI outcome here, but it produces the Valgrind reports locally.
@hishamhm Thanks for the test! This turned out to be yet another thing!
@hishamhm Would you give it another try with the current state of this branch? Except for a small failure in dynamic builds I'm still investigating it all runs green, so I hope it fixes everything now.
Ok I fixed the last problem but @flrgh found another problem in the Kong PR which I also am looking at now.
@flrgh Ok, the latest state of this branch should also take care of the Gateway issue. I have the spec/02-integration/20-wasm/ suite passing locally.
@flrgh Ok, the latest state of this branch should also take care of the Gateway issue. I have the
spec/02-integration/20-wasm/suite passing locally.
Nice! Looking good with 75e101780f09ade9a4bada0aa8340c0f6eb74f97 at my end too.
I'll give this branch a spin later today!
I have found one more bug that I am trying to get rid of before merging this.
@flrgh @hishamhm Merged! I think in time for the Gateway patch release, hopefully.