frankenphp icon indicating copy to clipboard operation
frankenphp copied to clipboard

Using Fibers causes epic crash

Open withinboredom opened this issue 3 years ago • 13 comments

Minimal code to reproduce:

<?php

do {
    $running = false;
    //$running = frankenphp_handle_request(function (): void {
        $fiber = new Fiber(function() {
            echo "Starting Fiber\n";
        });
        $fiber->start();
    //});
} while ($running);

With some slight modifications, it can also be reproduced in worker mode.

withinboredom avatar Oct 20 '22 17:10 withinboredom

@dunglas the following Docker file (props @cdaguerre in #374) appears to "fix" fibers. At least for this reproducer with manual testing. It needs more testing:

FROM dunglas/frankenphp:latest-builder-php8.3-alpine AS builder

COPY --from=caddy:builder-alpine /usr/bin/xcaddy /usr/bin/xcaddy

ENV CGO_ENABLED=1 XCADDY_SETCAP=1 CGO_CXXFLAGS=-fPIE CGO_CFLAGS=-fPIE CGO_LDFLAGS=-pie XCADDY_GO_BUILD_FLAGS='-buildmode=pie -ldflags="-w -s" -trimpath'
RUN xcaddy build \
    --output /usr/local/bin/frankenphp \
    --with github.com/dunglas/frankenphp=./ \
    --with github.com/dunglas/frankenphp/caddy=./caddy/ \
    --with github.com/dunglas/mercure/caddy \
    --with github.com/dunglas/vulcain/caddy

:partying_face: :crossed_fingers: :crossed_fingers: still testing...

withinboredom avatar Dec 13 '23 15:12 withinboredom

Great news! Don't hesitate to open a PR with this changes, so we can see if this fix the issue for all architectures.

dunglas avatar Dec 13 '23 21:12 dunglas

I'll do some proper testing by Monday (by updating the fiber branch), but I haven't seen a crash yet via manual testing.

withinboredom avatar Dec 13 '23 21:12 withinboredom

@withinboredom I had issues with fibers so I could also test this on my Cloud Run service but not really sure where can I get docker image to use with this fix.

piotrekkr avatar Jan 10 '24 09:01 piotrekkr

It doesn't fix it, per se, more-or-less just reduces the probability of a crash.

Edit to add: the best way to prevent a crash is to just not output anything at all inside a fiber.

withinboredom avatar Jan 10 '24 09:01 withinboredom

I've just encountered this issue and using the workaround from @withinboredom did resolve the exception. In this project the culprit seem to be the monolog logger as that is the only place fibers are being used.

erikfrerejean avatar Jul 04 '24 13:07 erikfrerejean

I started working on a cgo library several weeks ago to allow output from c to go without calling go. It's still a wip: https://github.com/withinboredom/cgoc

There's a segfault once the number of concurrent requests gets high (due to usage of some C synchronization primitives from go), and a memory leak, but the it's pretty fast by itself (~8gbs on my machine).

I hope to have it working sometime in the next few months as a potential solution.

withinboredom avatar Jul 04 '24 14:07 withinboredom

@withinboredom IMHO the best option would be to fix the issue directly in Go!

dunglas avatar Jul 09 '24 13:07 dunglas

@dunglas I highly doubt it will ever be fixable, for very valid reasons. The reason it is failing boils down to the following:

  1. C creates a new thread
  2. C calls go_handle_request (ncgo = 1)
  3. Go calls frankenphp_execute_script (reenter C)
  4. PHP creates a fiber
  5. C calls Go (go_ub_write for example) (ncgo = 2)
  6. crash as designed

According to the CL (https://go-review.googlesource.com/c/go/+/530480) this means changing the stack for an ncgo > 1 will never be possible -- for very valid safety reasons. This was a huge part of my approach in taking over Go threads (ncgo <= 1 always).

If we can fix the ncgo issue, then we are free to muck around with the stack as much as we want.

withinboredom avatar Jul 09 '24 22:07 withinboredom

One way to fix it might be to have go_handle_request return a pointer that we can continue with (making ncgo = 0), then continuing in C to frankenphp_execute_script, so if a fiber is created, and we call things like go_ub_write, ncgo == 1 and it will just reset the stack bounds just fine (in theory).

withinboredom avatar Jul 09 '24 22:07 withinboredom

According to https://github.com/golang/go/issues/62130#issuecomment-1712330693, this seems fixable directly in Go for our case.

dunglas avatar Jul 09 '24 22:07 dunglas

This would work: C changes stack back

I've been tearing apart the Fiber/boost context implementation to see if I can pop the stack back to original and jump to go, then on returning, replace the stack. The only problem with this approach (and fwiw, I do have it mostly working) is that it requires assembly and I am only familiar with x86-64 assembly. We would need to write assembly for every architecture (and there are some big perf hits here).

withinboredom avatar Jul 10 '24 12:07 withinboredom

It turns out the patch to get it working is pretty darn simple.

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index 0d3cc40903..609c5dbc52 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -215,34 +215,6 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
 func callbackUpdateSystemStack(mp *m, sp uintptr, signal bool) {
        g0 := mp.g0

-       inBound := sp > g0.stack.lo && sp <= g0.stack.hi
-       if mp.ncgo > 0 && !inBound {
-               // ncgo > 0 indicates that this M was in Go further up the stack
-               // (it called C and is now receiving a callback).
-               //
-               // !inBound indicates that we were called with SP outside the
-               // expected system stack bounds (C changed the stack out from
-               // under us between the cgocall and cgocallback?).
-               //
-               // It is not safe for the C call to change the stack out from
-               // under us, so throw.
-
-               // Note that this case isn't possible for signal == true, as
-               // that is always passing a new M from needm.
-
-               // Stack is bogus, but reset the bounds anyway so we can print.
-               hi := g0.stack.hi
-               lo := g0.stack.lo
-               g0.stack.hi = sp + 1024
-               g0.stack.lo = sp - 32*1024
-               g0.stackguard0 = g0.stack.lo + stackGuard
-               g0.stackguard1 = g0.stackguard0
-
-               print("M ", mp.id, " procid ", mp.procid, " runtime: cgocallback with sp=", hex(sp), " out of bounds [", hex(lo), ", ", hex(hi), "]")
-               print("\n")
-               exit(2)
-       }
-
        if !mp.isextra {
                // We allocated the stack for standard Ms. Don't replace the
                // stack bounds with estimated ones when we already initialized

It turns out, because of a few conditions, nothing fancy is required:

  1. pthread is really nice to give us proper stack bounds from the fiber
  2. we are just "popping into go" to send some data in a channel and "pop back out"
  3. we aren't jumping to/from other threads and then calling back into go from a different thread (the stack is coherent)

If we are OK with having a custom version of go for forever ... then this is likely the best solution, but I highly doubt it would be accepted into go. Note that this is probably a very ugly crash if output is sent from a thread from the parallel extension... because (3) will be violated above. This can probably be mitigated by marshaling the output in C, to the "main" thread, if the current thread isn't the "main" thread. This needs some further testing.

Before I go into this further, are we ok with a custom go patch for the foreseeable future @dunglas? I will create a PR to go, arguing for this patch, but I suspect it won't be accepted.

If we are, this is what I propose:

A. testing for (3) above and verify if any further work is required B. create PR to apply the patch (might be better to just maintain a fork of go?) C. create a separate PR to apply any fixes/optimizations for (A)

withinboredom avatar Jul 12 '24 17:07 withinboredom