node-fibers icon indicating copy to clipboard operation
node-fibers copied to clipboard

Error: This Fiber is a zombie

Open glasser opened this issue 10 years ago • 13 comments

If I'm able to catch an exception with the text Error: this Fiber is a zombie, then that's a fibers bug, right? Not user error?

Working on finding a minimal reproduction, but while we're working on that, curious if we're correct that this is definitely a Fibers bug... (In 1.0.0; going to try 1.0.1 next.)

glasser avatar Jul 16 '13 00:07 glasser

OK, I figured out what we're doing.

We're calling Fiber.yield in a fiber that we don't keep any reference to. So it eventually gets GCd. This causes stack unwinding, which throws from the Fiber.yield. But our Fiber.yield is wrapped (very indirectly) in a try/finally. And the finally clause itself has a f.wait in it.

The effect for us (on OSX at least) is that the main fiber actually gets re-run from the top!

Here's a reproduction:

var Fiber = require('fibers');

// Run this every so often so that GC and DestroyOrphans happens.
setInterval(function () {
  Fiber(function () { global.gc() }).run();
}, 1000);

Fiber(function () {
  console.log("TOP");
  try {
    Fiber.yield();
  } finally {
    console.log("finally");
    var f = Fiber.current;
    setTimeout(function () {
      console.log("re-running");
      f.run();
      console.log("re-ran");
    }, 10);
    console.log("yielding");
    Fiber.yield();
    console.log("yeld");
  }
}).run();
console.log("finishing main program");

For me, with Node 0.8.24 and Fibers 1.0.0, this prints:

$ node --expose-gc zombie.js 
TOP
finishing main program
finally
yielding
re-running
TOP
re-ran
finally
yielding
re-running
TOP
re-ran
finally
yielding
re-running
TOP
re-ran
finally
yielding
re-running
TOP
re-ran

and so forth.

glasser avatar Jul 16 '13 00:07 glasser

So I guess the short answer is, I shouldn't be letting a fiber that I care about get GCed. And in fact, the strategy of using try { Fiber.yield } finally { cleanup which involves waiting } is problematic.

But it certainly was surprising that the fiber got re-run from the top!

glasser avatar Jul 16 '13 00:07 glasser

Also occurs with Node 0.10.13, Fibers 1.0.1.

glasser avatar Jul 16 '13 01:07 glasser

At the very least, I think doing this should cause your code to crash, not re-run the fiber from the top. We would have figured out what was going on much earlier if that had happened.

glasser avatar Jul 16 '13 01:07 glasser

Ah, OK. The second call to Fiber.yield immediately re-throws the "zombie" exception, so by the time the callback runs, the fiber has completely unwound. And you can run fibers multiple times... and that's what the f.run() does.

Now, for me, the f.run() was actually hidden inside a future.return(). I think it is very surprising that future.return() (or throw) can call fiber.run on a fiber where fiber.started is false (ie, a fiber that may have been running when the wait was called but is not any more). Can this happen for any reason other than that the Future.wait got terminated by a zombie exception? If that's the only way that this can happen, can cb in Future.wait check fiber.started and exit with a message otherwise?

glasser avatar Jul 16 '13 05:07 glasser

So I guess the short answer is, I shouldn't be letting a fiber that I care about get GCed. And in fact, the strategy of using try { Fiber.yield } finally { cleanup which involves waiting } is problematic.

Actually if you grab a reference to the fiber in the finally it should interrupt the unrolling process and you'll be just fine.

But it certainly was surprising that the fiber got re-run from the top!

Yeah the overloading of run() to both start and a resume a fiber was a questionable design decision in retrospect. I think the main issue that needs to be addressed here is the fact that future.return() can re-run a thrown fiber future (if I'm understanding correctly). I will have to look into this in more detail for sure.

laverdet avatar Jul 17 '13 00:07 laverdet

Right, I agree that grabbing a reference to the Fiber in finally kept it from being GCed.

I don't think I was actually using FiberFuture, if that's what you meant. But yeah, I think the main issue was that a future.return() re-ran a fiber which had been waiting on that future but which got unwound. (Or well, a fiber which tried to wait on that future but when it called yield, it hit the "I'm trying to unwind, immediately rethrow zombie" check.)

glasser avatar Jul 17 '13 00:07 glasser

Hi @glasser and @laverdet,

I'm curious if you discovered a trick to prevent the zombie fiber issue?

This issue is periodically taking down our production app (not meteor). It does seem correlated to garbage collection, but I'm uncertain of the exact relationship.

I'm starting to dig into the fibers source to better understand what causes the zombie condition. If garbage collection is the source of the problem, I'm unclear why the fiber, or associated function, would be garbage collected, as it looks like the function is stored in a Persistent<Value> in fibers.cc which if I understand correctly, takes it out of garbage collection. And I'm also unclear on why storing a local variable to the fiber would somehow prevent this, since the local variable would be not reachable after a function returns, unless we store the fiber in a global structure of some sort.

I appreciate any insights, whatever they might be, as we begin our debug hunt!

Chris

Could this code cause a problem if the fiber isn't reachable on the heap?

function handleSomeRequest() {
  Fiber(function() { ... }).run();
}

cmather avatar Apr 05 '19 15:04 cmather

If the fiber isn't accessible on the heap then that means there is no way of the fiber ever waking up, because you can't ever call run on it again. In that case node-fibers will eventually wake up the fiber and force it to unwind by making Fiber.yield always throw. If you have a catch somewhere it would catch this exception.

Edit: There is nothing wrong with Fiber(function() { ... }).run(); as long as you do something with Fiber.current from inside the fiber. That's a pretty common pattern, actually. But if the fiber is waiting on an event that never comes it will eventually need to be unwound.

laverdet avatar Apr 05 '19 17:04 laverdet

Edit: Not so fast! Still debugging. Will post conclusions when we have them.

We resolved this issue, thanks to the clues you guys provided above. For posterity, in case it helps anyone else:

Reproduction:

> node --expose-gc index.js
var Fiber = require("fibers");

// Run this every so often so that GC and DestroyOrphans happens.
setInterval(function() {
  console.log("Running GC!");
  Fiber(function() {
    global.gc();
  }).run();
}, 1000);

Fiber(function() {
  try {
    Fiber.yield();
  } catch (e) {
    console.log(e);
  }
}).run();

In the C++ class backing Fiber, the fiber is stored in a Persistent<Object> handle which removes the object from the garbage collection process. However, in the constructor function for Fiber the SetWeak method is called on the handle. This registers a callback function that will be called when the v8 garbage collector runs when there are no more references to the object (other than the weak handle). This callback function then adds the fiber to the orphaned fibers array where it is turned into a zombie next time any fiber runs. Without a try/catch block, this zombie error seems to be swallowed.

The fix, as @glasser mentions above, is to add a local reference to the fiber. But the trick is to store this reference inside the fiber callback function itself. Otherwise we'd have to somehow keep track when the fiber has finished running (a meta gc!).

Fiber(function() {
  // add a ref to the current fiber, tricking the v8 gc so that the weak handle callback isn't called.
  var f = Fiber.current;
  try {
    Fiber.yield();
  } catch (e) {
    console.log(e);
  }
}).run();

cmather avatar Apr 05 '19 18:04 cmather

I must warn you that your fix here replaces an error with a memory leak.

laverdet avatar Apr 05 '19 18:04 laverdet

@laverdet - Thanks for your response and this extra hint. Your explanation makes sense. It looks like we must still be missing something in our app. If we didn't have a handle to the fiber somewhere then yeah we wouldn't be able to call run() on it to resume.

Edit: I hadn't seen your response before I wrote mine.

cmather avatar Apr 05 '19 18:04 cmather

For anyone else reaching this issue from Google and worrying that something is wrong with node-fibers:

It may very well be a bug in your code or another library where a callback is never invoked which that fibre is waiting on.

I had this error as well, "This Fiber is a zombie."

It turned out to be because of a bug in my application code.

In my case, I had implemented a re-entrant lock (mutex) which allows first caller to acquire lock, then requires successive calls to wait until the first caller calls release(). The problem ended up being that my rather hastily written lock implementation only stored the first callback for waiting (blocked) fibres! In fact, it wasn't even storing a reference to successive waiting (yielded) fibres.

So, naturally the garbage collector is GCing the fibre as there are no references to it, in my case.

If you are dealing with a web application, look for requests which are timing out or finishing with an error. (if you hold a reference to each fibre created, I found the request affected by the bug will never complete -- until the heap is exhausted!)

Hope this helps someone stop wasting time like I did worrying about a bug in node-fibers before checking their own code.

ailabs-software avatar Jul 29 '20 21:07 ailabs-software