fix: support for Fibers
Continuation of #99. Closes #46, closes #99.
I don't understand why fiber tests are failing in Docker containers but not when running directly in GHA runners. Adjusting the stack size with ulimit and setting -fstack-split seems to have no effect. Also, I don't manage to reproduce the issue locally (macOS + M1) or in Docker (tests are always green).
Alpine/musl doesn't look to be the problem here as fails also tests on Debian in Docker.
Do you have any idea of what can be the difference on @withinboredom?
I'll have to take a gander in the morning. At least for me, it looks like Docker has gone insane...
=> [internal] load .dockerignore 0.1s
=> => transferring context: 133B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 2.29kB 0.0s
=> CANCELED [internal] load metadata for docker.io/library/php-base:latest 0.5s
=> CANCELED [internal] load metadata for docker.io/mlocati/php-extension-installer:latest 0.5s
=> ERROR [internal] load metadata for docker.io/library/golang-base:latest 0.5s
=> [auth] library/php-base:pull token for registry-1.docker.io 0.0s
------
> [internal] load metadata for docker.io/library/golang-base:latest:
------
Dockerfile:6
--------------------
4 |
5 | COPY --from=golang-base /usr/local/go/bin/go /usr/local/bin/go
6 | >>> COPY --from=golang-base /usr/local/go /usr/local/go
7 |
8 | # This is required to link the FrankenPHP binary to the PHP binary
--------------------
ERROR: failed to solve: golang-base: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
I'm an idiot and forgot docker bake is a thing.
So @dunglas, I was digging around php-src to start that blog post :) and I saw this in the 8.3 notes:
Custom Fiber implementations have to initialize EG(stack_limit) and EG(stack_base).
Maybe frankenphp needs to implement a go-specific fiber implementation?
There's also the fiber.stack_size ini entry.
Internals: SAPIs that may execute PHP in alternative stacks have to set EG(stack_limit) and EG(stack_base)
From: https://github.com/php/php-src/pull/9104
That PR is amazing information about how the stack works. Something that Go might be screwing up:
Platforms with stacks that grow downwards are supported (probably all platforms supported by PHP). This assumes contiguous stacks.
Emphasis mine. I don't think the stack will be contiguous with Go involved.
Thanks for the pointer, this looks interesting! Passing --disable-fiber-asm doesn't help.
Regarding contiguous stacks, I'm not sure that Go will be a problem because FrankenPHP creates the threads itself, they aren't the same threads as the one started by Go.
they aren't the same threads as the one started by Go.
Yep, saw that as well. I wonder if we're running into the same situation as https://github.com/golang/go/issues/39411#issuecomment-639863069 and the PHP fiber implementation is somehow messing the main stack?
I added zend_call_stack_init(); in frankenphp_execute_script (from Zend/zend_call_stack.h) as an experiment. It went from passing 30-40% of the time to passing 70-80% of the time. There's probably a hint somewhere in that code path as to what is going on. Maybe. (this is PHP 8.3 specific, that code doesn't exist in 8.2)
One thing sticks out to me:
goroutine 120 [runnable, locked to thread]:
runtime.gcTrigger.test({0x0?, 0x0?, 0x0?})
/usr/local/go/src/runtime/mgc.go:547 +0xfb fp=0xc000b27990 sp=0xc000b27988 pc=0x41ed7b
runtime.mallocgc(0x4000, 0x788020, 0x1)
/usr/local/go/src/runtime/malloc.go:1171 +0x768 fp=0xc000b279f8 sp=0xc000b27990 pc=0x412f48
runtime.makeslice(0xac9120?, 0x788060?, 0xc000b27a68?)
/usr/local/go/src/runtime/slice.go:103 +0x52 fp=0xc000b27a20 sp=0xc000b279f8 pc=0x454212
github.com/dunglas/frankenphp.go_read_post(0x70ad45?, 0x409b46?, 0x4000)
/go/src/app/frankenphp.go:590 +0x87 fp=0xc000b27b48 sp=0xc000b27a20 pc=0x7079e7
_cgoexp_411032355cef_go_read_post(0x7f348abe9ee0)
_cgo_gotypes.go:985 +0x2c fp=0xc000b27b70 sp=0xc000b27b48 pc=0x70ae0c
runtime.cgocallbackg1(0x70ade0, 0xc000b27d20?, 0x0)
/usr/local/go/src/runtime/cgocall.go:315 +0x2b1 fp=0xc000b27c38 sp=0xc000b27b70 pc=0x409a51
runtime.cgocallbackg(0xc000482340?, 0x300000002?, 0xc000482340?)
/usr/local/go/src/runtime/cgocall.go:234 +0x109 fp=0xc000b27cc8 sp=0xc000b27c38 pc=0x409709
runtime.cgocallbackg(0x70ade0, 0x7f348abe9ee0, 0x0)
<autogenerated>:1 +0x2f fp=0xc000b27cf0 sp=0xc000b27cc8 pc=0x47382f
runtime.cgocallback(0x409585, 0x759b20, 0xc000b27d80)
/usr/local/go/src/runtime/asm_amd64.s:998 +0xb4 fp=0xc000b27d18 sp=0xc000b27cf0 pc=0x4710d4
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:463 fp=0xc000b27d20 sp=0xc000b27d18 pc=0x46f0e0
runtime.cgocall(0x759b20, 0xc000b27d80)
/usr/local/go/src/runtime/cgocall.go:167 +0x85 fp=0xc000b27d58 sp=0xc000b27d20 pc=0x409585
github.com/dunglas/frankenphp._Cfunc_frankenphp_execute_script(0x7ef0bc0a35d0)
_cgo_gotypes.go:770 +0x48 fp=0xc000b27d80 sp=0xc000b27d58 pc=0x7047e8
github.com/dunglas/frankenphp.go_execute_script(0xc000482301?)
/go/src/app/frankenphp.go:461 +0x207 fp=0xc000b27e20 sp=0xc000b27d80 pc=0x7067a7
_cgoexp_411032355cef_go_execute_script(0x7f4987ffea50?)
_cgo_gotypes.go:925 +0x1c fp=0xc000b27e38 sp=0xc000b27e20 pc=0x70abdc
runtime.cgocallbackg1(0x70abc0, 0xc000b27fe0?, 0x0)
/usr/local/go/src/runtime/cgocall.go:315 +0x2b1 fp=0xc000b27f00 sp=0xc000b27e38 pc=0x409a51
runtime.cgocallbackg(0x0?, 0x0?, 0x0?)
/usr/local/go/src/runtime/cgocall.go:234 +0x109 fp=0xc000b27f90 sp=0xc000b27f00 pc=0x409709
runtime.cgocallbackg(0x70abc0, 0x7f348abf0980, 0x0)
<autogenerated>:1 +0x2f fp=0xc000b27fb8 sp=0xc000b27f90 pc=0x47382f
runtime.cgocallback(0x0, 0x0, 0x0)
/usr/local/go/src/runtime/asm_amd64.s:998 +0xb4 fp=0xc000b27fe0 sp=0xc000b27fb8 pc=0x4710d4
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000b27fe8 sp=0xc000b27fe0 pc=0x471301
Is Go doing GC while executing a script? Am I reading that right?
GC is a red herring. I disabled it, and analysed a bunch of stack traces. There doesn't seem to really be much in common which makes me think this is on PHP's side, not Go's.
Oh, and zend_call_stack_init is also a red herring. After more rigorous testing, it is clear that I was just "lucky".
There's some asm in Go that switches to g0 upon returning from a cgo callback. I suspect that is where the culprit is. However, I cannot reproduce the crash in the debugger. Still trying.
From cgocall.go:
// The call from C is on gp.m's g0 stack, so we must ensure
// that we stay on that M. We have to do this before calling
// exitsyscall, since it would otherwise be free to move us to
// a different M. The call to unlockOSThread is in unwindm.
hmmmm....
I've managed to figure out how to reproduce it reliably while a debugger is running. Yet stepping through the code is enough to work around the issue. This is certainly a race condition:
- Reboot the computer.
- Startup debugging of frankenphp
- Make a request to localhost
- See the crash within the first 1-5 requests
- Goto (1)
I've narrowed down the issue to somewhere in _cgo_gotypes.go, which appears to be a generated file and not checked into source control. The issue appears to possibly be a bit of funky memory from PHP. Although that isn't certain yet. This is a very slippery bug.
In fact, the fact that it requires a reboot (unless you're running in Docker!) almost makes it smell like a kernel bug, but I doubt it. Probably some little chunk of memory gets cached in a CPU cache line that makes it work after the first requests, but switching namespaces in the kernel gets it evicted.
Requires a custom build of PHP, but I can no longer reproduce it locally...
./configure $conf_vars
echo #define ZEND_FIBER_UCONTEXT 1 >> main/php_config.h
make -j
ZEND_FIBER_UCONTEXT uses a different method for managing the stack that seems to work just fine with Go.
@withinboredom thanks for your in-depth research on this topic, I very much appreciate it!! Unfortunately, I don't think that defining ZEND_FIBER_UCONTEXT is enough, because it's already what this PR does and tests are still failing (at least with Docker): https://github.com/dunglas/frankenphp/pull/171/files#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R95-R96
Passing the --disable-fiber-asm flag defines ZEND_FIBER_UCONTEXT: https://github.com/php/php-src/blob/php-8.2.9/configure.ac#L1317
:sigh: ok, I have another lead, but that was the most promising.
Funny enough, today I did run into it with the asm disabled locally (outside of tests).
Anyway, I think I know what is going on. At least I'm fairly certain. This bug should affect ALL cgo programs, if I'm actually understanding things correctly.
Essentially the issue is that we're passing a pointer from C to Go that contains pointers to Go structures... in morestack, it calls newstack which makes a copy of the stack and rewrites all pointers. When we're using Fibers, this causes the stack to grow upon reentry to Go, which causes it to rewrite all pointers.
Here's the thing. These pointers are owned by C structs, not Go structs. Thus they crash quite spectacularly when they don't get rewritten by newstack.
I think this can be fixed by having a hash in Go that the C structs then reference (main context / current context). Essentially:
// Go Land
$contexts = ['main' => $ctx, 0 => $otherctx, 2 => ... ];
// C Land
{
go_call_thingy('main', 0);
}
// Go Land
go_call_thingy($main, $current) {
$main = $contexts[$main];
$current = $contexts[$current];
}
or something like that. Here's the thing, once a few requests are served, the stack stays pretty stable, thus that is why it is hard to reproduce on 'real servers' vs. these tests which keep the stack really small and thus easy to grow, especially the g0 stack.
The moral of the story is that I think we cannot reference Go structs from C structs or vice-versa.
Thanks for sharing your findings, this looks promising!
Maybe are we violating the pointers passation rules in some way?
Do you have already identified the offending pointers? Maybe could be avoid passing them from Go to C, or ping them?
Also, we should set GODEBUG=cgocheck=2 in tests to see if we get better errors.
Do you have already identifier the offending pointers? Maybe could be avoid passing them from Go to C, or ping them?
Ah, yep. I guess I accidentally hinted to them. It's the HTTP contexts from Caddy which are stored in the current/main contexts in Frankenphp's C code.
I just reviewed the code again, and I don't think that we store any Go pointer in C memory.
We use runtime/cgo.Handle (which is designed to not have to pass Go pointers to C).
Do you have any GDB backtrace? (I still cannot reproduce locally, most likely because my CPU an Apple M1 which seems unaffected by this bug).
I went down quite a rabbit hole trying to find a bad pointer.
Handy bit of C:
static void debug_log(char *data) {
pid_t id = gettid();
fprintf(stderr, "tid %d: %s", id, data);
}
- after commenting out sections, hardcoding values from Go, and adding debugging statements everywhere, I realized that I was getting a crash before a thread could be reused. After adding debugging code in thpool, this turned out to not be true.
- as far as I can tell, the crash almost always happens between the call (cgo asm) to
go_add_headerfromfrankenphp_send_headers. It doesn't matter what you pass to it, whether it is an invalid pointer or not, whether it is junk length or junk data. Just simply making the call breaks everything. - then I realized that was not the issue, after a
git stashand commenting out thego_header*funcs. No, I also had to remove the call togo_ub_write. Once those were commented out, there was no more crashing. - Obviously, this is not ideal. However, it appears that simply outputting something from a PHP fiber causes Go to crash in the assembly code. Sometimes. In between the C and the actual entry of the Go program. I have no idea why, I cannot catch it while debugging.
This is beyond my expertise to resolve now, as I'm not intimately familiar with Go and assembly. Here's a minimum diff to get tests to run on this branch (though they still fail due to not outputting anything) without crashing, uncommenting either one of those functions will cause the crash when running tests.
Apparently, this may be amd64 specific. I can reproduce this in an ubuntu 22.04 VM without fail, or a docker container. On my baremetal Linux laptop (pop-os 22.04), I cannot reproduce it reliably (if at all).
Now I'm wondering if this is ubuntu specific... but I'll have to test that later, it's 1am.
cc: @krakjoe
@withinboredom here is why: https://github.com/golang/go/issues/62130#issuecomment-1684431260
We'll likely have to patch Go to add support for Fibers.
I wonder if we can do this in C.
- detect we are in a fiber context
- reset the stack back to what Go expects
- call go
- on return, reset the stack to what PHP expects
- resume program
Huh, dunno how I missed this, but zend_timeout_handler gets called from a syscall to sleep, even from Go. So, as soon as Go sleeps for any reason, it tries to call zend_timeout_handler... this causes a crash, instantly.
Regarding zend_timeout_handler, are you sure that you compiled PHP with --enable-zend-max-execution-timers? This feature prevents this issue and we have tests for this.
are you sure that you compiled PHP with --enable-zend-max-execution-timers
Double-checked and verified. Running 8.3-dev fwiw.
In fact, building with --disable-zend-max-execution-timers causes tests to pass with fibers. It seems there is a bug with that feature?
It looks like when it tries to run zend_timeout_handler while in a Go context, with the Go stack. The whole thing crashes immediately. Unless you happen to be stepping through the thread it happens in, it just looks like the Go thread is parked in the stack trace.
Doing this:
static void disable_timers() {
#ifdef ZEND_MAX_EXECUTION_TIMERS
// unset default timeout
// TODO: add support for max_input_time
zend_unset_timeout();
#endif
}
static void restart_timers() {
#ifdef ZEND_MAX_EXECUTION_TIMERS
// Reset default timeout
// TODO: add support for max_input_time
zend_set_timeout(INI_INT("max_execution_time"), 0);
#endif
}
and wrapping calls to go that happen during scripts also seems to work. It'd be handy if zend_unset_timeout returned the seconds remaining so we could zend_set_timeout with the correct time.
Turns out to be that PHP isn't installing the signal handlers properly. I'm opening a PR over there to get it fixed in 8.2/8.3 -- well, hopefully. I've never opened a PR over there.
I can assist you and review if you need any help (it's probably related to changes I've done to PHP). Feel free to contact me in DM!