ngx_wasm_module icon indicating copy to clipboard operation
ngx_wasm_module copied to clipboard

feat(*) Lua bridge refactor and dispatch fixes

Open thibaultcha opened this issue 1 year ago • 1 comments

Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads, and refactor how dispatch calls failures to continue executing the request. Replaces #539, #545, #523.

The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via ngx.thread). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup.

List of significant changes for this refactor:

  • Breaking: failing dispatch calls do not produce HTTP 500 responses anymore. Instead, the request execution is resumed normally.
  • Breaking: the proxy_wasm.start() FFI function is removed. Only proxy_wasm.attach() is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module rewrite or access phases are entered. Prior, proxy_wasm.start() would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding.
  • The ngx.semaphore API can be used in the Lua bridge. The default Lua resolver now has synchronization enabled.
  • In ngx_wasm_socket_tcp, the sock->env member is now a pointer to the request's env instead of a copy so as to manipulate the env->state control variable.
  • The wasm_call directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality.
  • A new rctx->resume_handler pointer holds the resume entry point back from yielding facilities into ngx_http_core_run_phases. For now, only the Lua bridge uses it, but other yielding facilities should be refactored to use it so as to factorize our resuming code.

Fix #524 Fix #528

thibaultcha avatar May 14 '24 16:05 thibaultcha

Codecov Report

Attention: Patch coverage is 88.99614% with 57 lines in your changes missing coverage. Please review.

Project coverage is 90.36951%. Comparing base (277fac6) to head (b9bb644).

Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##                main        #546         +/-   ##
===================================================
+ Coverage   90.05848%   90.36951%   +0.31103%     
===================================================
  Files             47          47                 
  Lines          10089       10311        +222     
===================================================
+ Hits            9086        9318        +232     
+ Misses          1003         993         -10     
Files Coverage Δ
src/common/lua/ngx_wasm_lua_ffi.c 92.92035% <100.00000%> (-0.67965%) :arrow_down:
src/common/ngx_wasm_socket_tcp_readers.c 87.81513% <100.00000%> (ø)
src/common/ngx_wasm_subsystem.c 100.00000% <100.00000%> (ø)
src/common/proxy_wasm/ngx_proxy_wasm.c 92.58850% <100.00000%> (+0.04952%) :arrow_up:
src/common/proxy_wasm/ngx_proxy_wasm.h 91.89189% <ø> (ø)
src/common/proxy_wasm/ngx_proxy_wasm_host.c 94.37751% <100.00000%> (-0.02250%) :arrow_down:
src/common/proxy_wasm/ngx_proxy_wasm_util.c 94.14062% <ø> (ø)
src/http/ngx_http_wasm_util.c 87.06199% <100.00000%> (+0.07011%) :arrow_up:
src/http/proxy_wasm/ngx_http_proxy_wasm.c 93.71728% <100.00000%> (+0.20377%) :arrow_up:
src/http/proxy_wasm/ngx_http_proxy_wasm_dispatch.c 91.62562% <100.00000%> (+0.39754%) :arrow_up:
... and 8 more
Flag Coverage Δ
unit 90.13820% <88.56502%> (-0.01297%) :arrow_down:
valgrind 81.21406% <85.62092%> (+3.19863%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

codecov[bot] avatar May 14 '24 17:05 codecov[bot]

@hishamhm I would like for the datakit filter to be tested with https://github.com/Kong/ngx_wasm_module/pull/546 before merging it

@thibaultcha I've just tested it and it's still crashing. Running it enough times I do get segfaults, but every run is reporting memory access errors consistently.

These are the logs from a Gateway run with Valgrind. This particular run didn't crash, but Valgrind reported reads on data that was freed by ngx_proxy_wasm_dispatch_calls_cancel: pr-546-kong-error.log pr-546-kong-valgrind.log

This was running a datakit configuration with 2 parallel calls that fail (I didn't spin up the other localhost upstream).

hishamhm avatar May 17 '24 14:05 hishamhm

@hishamhm What is the specific reproducible example for this? It does not trigger in the test given with this PR, and seems to trigger when send_local_response is used at some point (as per the Valgrind report). I tried updating that test with:

    location /t {
        proxy_wasm hostcalls 'on=request_body \
                              test=/t/dispatch_http_call \
                              host=127.0.0.1:1 \
                              ncalls=2';
        proxy_wasm hostcalls 'on=request_body \
                              test=/t/echo/body';
        echo ok;
    }

But it does not trigger any memory issues.

thibaultcha avatar May 17 '24 15:05 thibaultcha

@thibaultcha Sure, let me try to get the same filter crash without the gateway.

hishamhm avatar May 17 '24 16:05 hishamhm

Took me a while but I got a consistent repro running without the gateway: https://github.com/Kong/ngx_wasm_module/actions/runs/9133572211/job/25117356096

This branch/commit was intended only for sharing the test case: https://github.com/Kong/ngx_wasm_module/commit/495c13dcd2aaf5273d9e75a121cfec4d8c7793bd (Loading Datakit on Wasmtime with Valgrind takes a couple of minutes, so I enabled the cache in the testcase as a quality-of-life tweak for speeding up debugging runs...)

I think we can eventually isolate the offending proxy-wasm calls and produce a hostcalls-based testcase, but to go one step at a time I wanted to get the error reproduced without the Gateway on Datakit first.

Looking at the logs, I think the triggering condition is to dispatch a call, then in the same handler (in this case, on_request_headers), trigger a local response, before the dispatch callback gets a chance to run.

hishamhm avatar May 17 '24 20:05 hishamhm

I think I got a minimized test case. Branch tests/interrupted-dispatch on top of this PR — this test: https://github.com/Kong/ngx_wasm_module/commit/6cb6899c75b70370fac89f5dc47e246ef5c19ab7

Still waiting for the CI outcome here, but it produces the Valgrind reports locally.

hishamhm avatar May 17 '24 21:05 hishamhm

@hishamhm Thanks for the test! This turned out to be yet another thing!

thibaultcha avatar May 18 '24 06:05 thibaultcha

@hishamhm Would you give it another try with the current state of this branch? Except for a small failure in dynamic builds I'm still investigating it all runs green, so I hope it fixes everything now.

thibaultcha avatar May 28 '24 16:05 thibaultcha

Ok I fixed the last problem but @flrgh found another problem in the Kong PR which I also am looking at now.

thibaultcha avatar May 29 '24 02:05 thibaultcha

@flrgh Ok, the latest state of this branch should also take care of the Gateway issue. I have the spec/02-integration/20-wasm/ suite passing locally.

thibaultcha avatar May 29 '24 02:05 thibaultcha

@flrgh Ok, the latest state of this branch should also take care of the Gateway issue. I have the spec/02-integration/20-wasm/ suite passing locally.

Nice! Looking good with 75e101780f09ade9a4bada0aa8340c0f6eb74f97 at my end too.

flrgh avatar May 29 '24 18:05 flrgh

I'll give this branch a spin later today!

hishamhm avatar May 29 '24 19:05 hishamhm

I have found one more bug that I am trying to get rid of before merging this.

thibaultcha avatar Jun 04 '24 17:06 thibaultcha

@flrgh @hishamhm Merged! I think in time for the Gateway patch release, hopefully.

thibaultcha avatar Jun 11 '24 15:06 thibaultcha