ponyc icon indicating copy to clipboard operation
ponyc copied to clipboard

Segmentation fault when actor receives a reference to itself via a class created in a different actor.

Open Perelandric opened this issue 7 years ago • 55 comments

_EDIT:_ Skip down to https://github.com/ponylang/ponyc/issues/1118#issuecomment-238431412 to see the most reduced example of the issue.


I gutted the HTTP server down to this, which I think is a reproduction of the seg fault in #937. I kept the original type names so that they could be somewhat related back to that package if necessary. Seems to have something to do with the partial function this~answer(). At least if I interrupt anything after that assignment, it the seg fault disappeared.

interface val ResponseHandler
  fun val apply(request: Payload val, response: Payload val): Any

interface val RequestHandler
  fun val apply(request: Payload): Any


primitive Handle
  fun val apply(request: Payload) =>
    (consume request).respond(Payload)


actor _ServerConnection
  let _handler: RequestHandler

  new create(h: RequestHandler) => _handler = h

  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    _handler(consume request)

  be answer(request: Payload val, response: Payload val) =>
    None


class iso Payload
  var handler: (ResponseHandler | None) = None

  fun iso respond(response': Payload) =>
    try
      let h = (handler) as ResponseHandler
      h(consume this, consume response')
    end


actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end


actor Test
  be do_it() =>
    _ServerConnection(Handle).dispatch(Payload)

This could probably be further reduced, but I wanted to maintain at least a slight semblance to the original code... and it's the middle of the night so I'm going to :sleeping:.

Perelandric avatar Aug 08 '16 07:08 Perelandric

Here's a further reduced example:

interface val ResponseHandler
  fun val apply(request: Payload val): Any


actor _ServerConnection
  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    try
      let h = request.handler as ResponseHandler
      h(consume request)
    end

  be answer(request: Payload val) => None


class iso Payload
  var handler: (ResponseHandler | None) = None


actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end


actor Test
  be do_it() =>
    _ServerConnection.dispatch(Payload)

If I change the dispatch method to this, the seg fault is gone:

  be dispatch(request: Payload) =>
    let x: (ResponseHandler | None) = recover this~answer() end
    try
      let h = x as ResponseHandler
      h(consume request)
    end

but if I leave off the : (ResponseHandler | None) on the x declaration, it complains of capture violations. If I then make it recover to val, it compiles and the seg fault is again gone.

    let x = recover val this~answer() end

Or if I simply set request.handler = None after the let h assignment, the seg fault is gone.

actor _ServerConnection
  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    try
      let h = request.handler as ResponseHandler
      request.handler = None
      h(consume request)
    end

So it seems like it has something to do with sending the _ServerConnection into its own .answer() method via the partial function that captures this and gets assigned to the Payload.

Unfortunately, adding handler = None in the respond method in net/http/payload.pony does not fix it. :confused:

Perelandric avatar Aug 08 '16 15:08 Perelandric

Given that last bit of info, I can reduce it just a bit more:

use "collections"

interface val ResponseHandler
  fun val apply(request: Payload val): Any


actor _ServerConnection
  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    this.answer(consume request)

  be answer(request: Payload val) => None


class iso Payload
  var handler: (ResponseHandler | None) = None


actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end


actor Test
  be do_it() =>
    _ServerConnection.dispatch(Payload)

Note that if I create the Payload directly in dispatch instead of passing it from do_it(), we don't get the seg fault. Same if I eliminate _ServerConnection and put all its methods right in Test.

I tried to see if I could create the seg fault by passing the partial function directly to answer instead of via Payload, but wasn't able to get the type quite right for the parameter.

Perelandric avatar Aug 08 '16 15:08 Perelandric

We really don't need the partial function at all. The same seg fault can be found by having Payload directly reference the _ServerConnection object.

actor _ServerConnection
  be dispatch(request: Payload) =>
    request.handler = recover this end
    this.answer(consume request)

  be answer(handler: Payload val) => None


class iso Payload
  var handler: (_ServerConnection | None) = None

Perelandric avatar Aug 08 '16 16:08 Perelandric

So this last 8 line example is enough on its own to cause a segfault? Nicely narrowing of the issue down @Perelandric

SeanTAllen avatar Aug 09 '16 00:08 SeanTAllen

@SeanTAllen Thanks, and yes, with the Main and Test actors from the example above it, it should reproduce the seg fault. I'll put a full version of the smallest example in this comment.

It'll be interesting to see if this actually fixes the httpserver example. There was some other odd behavior that I noted in issue #937 which is where my investigation began, but this is certainly a piece of the puzzle if not the entire thing.


Here's the most simplified yet complete code that I could find to reproduce the issue:

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end


actor Test
  be do_it() => _ServerConnection.dispatch(Payload)


class iso Payload
  var handler: (_ServerConnection | None) = None


actor _ServerConnection
  be dispatch(p: Payload) =>
    p.handler = recover this end
    this.answer(consume p) // <-- Pass the Payload that references this _ServerConnection

  be answer(p: Payload val) => None

Notes:

  • In the more complete examples, the _ServerConnection is assigned to the Payload via a partial function binding instead of being assigned directly.
  • If the Payload object is created directly in dispatch() instead of at the call location in do_it(), there's no seg fault.
  • If you assign None to the .handler field instead of the _ServerConnection object, there's no seg fault.
  • The seg fault is intermittent. It's more likely to happen using a Range with 10_000, than with 100.

Perelandric avatar Aug 09 '16 01:08 Perelandric

Have you tried running a debug version in LLDB to see what is causing the segfault?

SeanTAllen avatar Aug 09 '16 02:08 SeanTAllen

@SeanTAllen Never used LLDB but this'll be as good a reason as any to tinker. I'll see if I can give it a shot tomorrow.


EDIT: Output from my first, mostly blind use of LLDB

(lldb) target create "./test"
Current executable set to './test' (x86_64).
(lldb) run
Process 18472 launched: './test' (x86_64)
Process 18472 exited with status = 0 (0x00000000) 
(lldb) run
Process 18485 launched: './test' (x86_64)
Process 18485 stopped
* thread #2: tid = 18490, 0x0000000000408063 test`pony_traceunknown + 3, name = 'test', stop reason = signal SIGSEGV: invalid address (fault address: 0x40)
    frame #0: 0x0000000000408063 test`pony_traceunknown + 3
test`pony_traceunknown:
->  0x408063 <+3>:  cmpq   $0x0, 0x40(%rax)
    0x408068 <+8>:  je     0x408070                  ; <+16>
    0x40806a <+10>: movq   0x18(%rdi), %rax
    0x40806e <+14>: jmpq   *%rax
(lldb) bt
* thread #2: tid = 18490, 0x0000000000408063 test`pony_traceunknown + 3, name = 'test', stop reason = signal SIGSEGV: invalid address (fault address: 0x40)
  * frame #0: 0x0000000000408063 test`pony_traceunknown + 3
    frame #1: 0x00000000004020ef test`Payload_Trace + 15
    frame #2: 0x000000000040e360 test`ponyint_gc_handlestack + 64
    frame #3: 0x0000000000407fa0 test`ponyint_mark_done + 32
    frame #4: 0x0000000000409304 test`ponyint_actor_run + 772
    frame #5: 0x0000000000405b85 test`run + 69
    frame #6: 0x0000000000405edd test`run_thread + 29
    frame #7: 0x00007ffff7bc4184 libpthread.so.0`start_thread + 196
    frame #8: 0x00007ffff74d737d libc.so.6`clone + 109
(lldb) 

Had trouble setting breakpoints.

Perelandric avatar Aug 09 '16 02:08 Perelandric

@SeanTAllen One more observation: It seems if the parameter of the answer() method is tag or iso, there's no segfault. Only a problem with val.

// -------------------v--- iso and tag are OK, but not val
be answer(p: Payload val) => None

Perelandric avatar Aug 24 '16 23:08 Perelandric

Just out of curiosity, is there any news on this one? After watching the VUG about garbage collection I wondered if the error should be looked for in that area. Just a wild guess.

peterzandbergen avatar Sep 16 '16 12:09 peterzandbergen

Nothing yet.

SeanTAllen avatar Sep 16 '16 13:09 SeanTAllen

Difficult one I guess

peterzandbergen avatar Sep 16 '16 14:09 peterzandbergen

And we could use more folks contributing. Its hard to keep up and do the day job and all that. Such is things... ¯_(ツ)_/¯

SeanTAllen avatar Sep 16 '16 14:09 SeanTAllen

Hrm ... I just tried @Perelandric's minimal example from August 8 and I didn't get a seg fault. I tried both a debug and release build on OS X (10.11.4) using compiler version 0.3.0-14-gd8c6c55.

aturley avatar Sep 16 '16 18:09 aturley

@perelandric can you see if this still happens for you?

SeanTAllen avatar Sep 16 '16 18:09 SeanTAllen

@SeanTAllen Yep, took a few extra runs before it would segfault, but it got there.

@aturley Try upping the range from 10_000 to 100_000. That gives me the segfault more reliably.

Perelandric avatar Sep 16 '16 19:09 Perelandric

I got a little more information from lldb:

(lldb) run
Process 12481 launched: './build/issue-1118' (x86_64)
Process 12481 stopped
* thread #2: tid = 0x3422ef, 0x000000010000b862 issue-1118`pony_traceunknown(ctx=0x00000001097ffc48, p=0x00000001217dd000, m=2) + 34 at trace.c:114, stop reason = EXC_BAD_ACCESS (code=1, address=0x40)
    frame #0: 0x000000010000b862 issue-1118`pony_traceunknown(ctx=0x00000001097ffc48, p=0x00000001217dd000, m=2) + 34 at trace.c:114
   111  {
   112    pony_type_t* t = *(pony_type_t**)p;
   113 
-> 114    if(t->dispatch != NULL)
   115    {
   116      ctx->trace_actor(ctx, (pony_actor_t*)p);
   117    } else {

Something is going a little sideways here in GC land.

aturley avatar Sep 19 '16 02:09 aturley

Yeah. At the moment, this is waiting on @sylvanc or I having time to address. At the moment, I'm working on new website before returning back to the land of pony code. I'm not going to get to this til sometime in mid October. If anyone wants to take a shot at it before then, have at it.

SeanTAllen avatar Sep 19 '16 02:09 SeanTAllen

Hi, seems like by a happy coincidence it's just a simple typo.

-  pony_type_t* t = *(pony_type_t**)p;
+  pony_type_t* t = (pony_type_t*)p;

After a quick fix all is fine. Here is the cure https://github.com/ponylang/ponyc/pull/1249

dimiii avatar Sep 22 '16 10:09 dimiii

@dimiii all CI jobs failed with a segfault with that change applied. I left some questions for you on #1249.

SeanTAllen avatar Sep 22 '16 11:09 SeanTAllen

It looks like this is an uninitialised type descriptor pointer problem. I'll try digging into it.

Praetonus avatar Sep 22 '16 13:09 Praetonus

I think I got it. This is a premature free issue.

Full tracing of val objects sent in messages is deferred until a GC cycle on the owner. This means the _ServerConnection assigned to the Payload and then sent in a message (via the Payload) has a reference count greater than 0 but isn't aware of it until the Payload is traced. By then, the _ServerConnection might have been GC'd, which of course results in bad stuff happening.

Praetonus avatar Sep 22 '16 14:09 Praetonus

@Praetonus @SeanTAllen

BTW, Do you have plans to bootstrap the compiler? I mean the compiler and runtime.

dimiii avatar Sep 22 '16 15:09 dimiii

@dimiii - Building a Pony compiler in Pony is a long-term goal, though it represents a major amount of work. There are no plans to write the runtime in Pony, as it is built on lower-level constructs that would be impractical to provide in Pony.

jemc avatar Sep 22 '16 17:09 jemc

@dimiii re: the runtime point @jemc from a safety standpoint, a Pony runtime written in something like Rust would be awesome. However, C is far more accessible to the community and as such, the plans are to stay with C for the forseeable future for the runtime.

SeanTAllen avatar Sep 22 '16 17:09 SeanTAllen

@Praetonus should this be relabeled from "ready for work"?

SeanTAllen avatar Sep 28 '16 00:09 SeanTAllen

I'm not sure. I can see a fix for this but I think it would have some performance penalty. Maybe removing it until we've discussed it on the sync call would make sense.

Praetonus avatar Sep 28 '16 00:09 Praetonus

Stability should be more important, imho. I would not advocate using a tool for server applications if it could occasionally crash.

peterzandbergen avatar Sep 28 '16 05:09 peterzandbergen

Especially when the design of the language is to eliminate crashes caused by the programmer.

peterzandbergen avatar Sep 28 '16 05:09 peterzandbergen

no one is suggesting not fixing. the concern is, can we find a way to fix without having a performance impact.

SeanTAllen avatar Sep 28 '16 10:09 SeanTAllen

@peterzandbergen - it definitely still needs to be fixed, and it's still marked as high priority. Runtime segmentation faults are not acceptable .

jemc avatar Sep 28 '16 15:09 jemc

Ok, issue located. @sylvanc will be working on this. Thank you everyone who contributed to finding this. Extra props to @Perelandric for lots of narrowing down a minimal case and @Praetonus for digging into the code.

SeanTAllen avatar Sep 28 '16 20:09 SeanTAllen

Yes, thank you VERY much! This is an important and subtle and tricky one.

sylvanc avatar Sep 28 '16 20:09 sylvanc

Well done, never doubted that this would not be picked up in a serious way. Kudos

peterzandbergen avatar Sep 28 '16 20:09 peterzandbergen

I'm trying to verify that #1507 fixes this issue but I'm unable to reproduce the reported problem.

OSX El Capitan. LLVM 3.9.1

@Praetonus @Perelandric @aturley can any of you reproduce?

SeanTAllen avatar Jan 10 '17 03:01 SeanTAllen

Both the original code and the most simplified code here give me "can't find declaration of 'Range'". Why?

jkleiser avatar Jan 10 '17 11:01 jkleiser

@jkleiser You have to add use "collections" at the beginning of the code.

Praetonus avatar Jan 10 '17 16:01 Praetonus

Thanks @Praetonus . The simplified code now got compiled and ran without problems on my Mac with ponyc 0.10.0-74555f7.

jkleiser avatar Jan 10 '17 19:01 jkleiser

The simplified code works on my Ubuntu machine. Both optimized and with ponyc --debug.

0.10.0-1c33065 [release] compiled with: llvm 3.9.0 -- gcc-5 (Ubuntu 5.4.1-2ubuntu1~14.04) 5.4.1 20160904

However, when I run the original code I get Segmentation fault (core dumped)

moesol avatar Jan 23 '17 06:01 moesol

I have noticed that the issue did not occur in my environment anymore, being Ubuntu Mate in virtual box. Will test again this week to seeing if the problem has been resolved.

Op ma 23 jan. 2017 07:54 schreef Robert Hastings [email protected]:

The simplified code works on my Ubuntu machine. Both optimized and with ponyc --debug.

0.10.0-1c33065 [release] compiled with: llvm 3.9.0 -- gcc-5 (Ubuntu 5.4.1-2ubuntu1~14.04) 5.4.1 20160904

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ponylang/ponyc/issues/1118#issuecomment-274413255, or mute the thread https://github.com/notifications/unsubscribe-auth/AI2XrpZfLdrKh8_toJs6Mh3PLyCqZ6Vhks5rVE6_gaJpZM4JeyOG .

--

Peter Zandbergen Tel: +31 6 460 55 872

linkedIn: peterzandbergen https://www.linkedin.com/in/peterzandbergen

peterzandbergen avatar Jan 23 '17 07:01 peterzandbergen

@peterzandbergen See the discussion at: https://github.com/ponylang/ponyc/pull/1507

Perelandric avatar Jan 23 '17 22:01 Perelandric

Just an update. @sylvanc is still working on a fix for this.

SeanTAllen avatar Apr 29 '17 22:04 SeanTAllen

Original version still segfaults as of 0.21.0

SeanTAllen avatar Dec 21 '17 01:12 SeanTAllen

The original version of this segfaults if I add some @printf output that slows everything down some. But, only when doing an optimized build.

0.32.0-a7a80758 [release] compiled with: llvm 7.0.1 -- cc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 Defaults: pic=true

SeanTAllen avatar Dec 13 '19 19:12 SeanTAllen

The original version of this segfaults for me but none of the others. I looked at it for a while and I'm somewhat confused. I hope to be able to spend more time on it and maybe get some more eyes on it.

SeanTAllen avatar Jan 27 '22 05:01 SeanTAllen

The original version that segfaults doesn't if the usage of Payload val is changed to Payload so that an iso never becomes a val. There's still a bug there.

works:

use "collections"

interface val ResponseHandler
  fun val apply(request: Payload, response: Payload): Any

interface val RequestHandler
  fun val apply(request: Payload): Any


primitive Handle
  fun val apply(request: Payload) =>
    (consume request).respond(Payload)


actor _ServerConnection
  let _handler: RequestHandler

  new create(h: RequestHandler) => _handler = h

 be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    _handler(consume request)

  be answer(request: Payload, response: Payload) =>
    None


class iso Payload
  var handler: (ResponseHandler | None) = None

  fun iso respond(response': Payload) =>
    try
      let h = (handler) as ResponseHandler
      h(consume this, consume response')
    end


actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end


actor Test
  be do_it() =>
    _ServerConnection(Handle).dispatch(Payload)

This probably means that Benoit's statement here:

I think I got it. This is a premature free issue.

Full tracing of val objects sent in messages is deferred until a GC cycle on the owner. This means the >_ServerConnection assigned to the Payload and then sent in a message (via the Payload) has a reference count >greater than 0 but isn't aware of it until the Payload is traced. By then, the _ServerConnection might have been >GC'd, which of course results in bad stuff happening.

might still be true.

When I look at it in the debugger, it sure seems like what we are pointing at in trace_unknown is now junk. That said, I'm not positive yet.

SeanTAllen avatar Jan 27 '22 05:01 SeanTAllen

Here's an updated example for what still fails:

use "collections"

interface val ResponseHandler
  fun val apply(request: Payload val, response: Payload): Any

interface val RequestHandler
  fun val apply(request: Payload): Any


primitive Handle
  fun val apply(request: Payload) =>
    (consume request).respond(Payload)


actor _ServerConnection
  let _handler: RequestHandler

  new create(h: RequestHandler) => _handler = h

 be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    _handler(consume request)

  be answer(request: Payload val, response: Payload) =>
    None


class iso Payload
  var handler: (ResponseHandler | None) = None

  fun iso respond(response': Payload) =>
    try
      let h = (handler) as ResponseHandler
      h(consume this, consume response')
    end


actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 1_000_000) do
      t.do_it()
    end


actor Test
  be do_it() =>
    _ServerConnection(Handle).dispatch(Payload)

The important part of this is that the request object for ResponseHandler is going from an iso when sent to a val on receive. If it isn't a val on receive then all is good.

I'm not sure exactly where the bug is however, an important difference introduced here is that in recv_remote_object we do not recurse the object if it is immutable. IE this code:

  if(!obj->immutable)
    recurse(ctx, p, t->trace);

is executed when Payload is also an iso and then we have no crash.

SeanTAllen avatar Jan 27 '22 18:01 SeanTAllen

Here's the backtrace:

* thread #8, name = '1118-1', stop reason = signal SIGSEGV: invalid address (fault address: 0x58)
  * frame #0: 0x00000000004158ab 1118-1`pony_traceunknown(ctx=0x00007ffff7c22108, p=0x00007fffde3f1400, m=1) at trace.c:127:18
    frame #1: 0x000000000041360e 1118-1`ponyint_gc_handlestack(ctx=0x00007ffff7c22108) at gc.c:677:5
    frame #2: 0x0000000000415629 1118-1`ponyint_mark_done(ctx=0x00007ffff7c22108) at trace.c:75:3
    frame #3: 0x000000000040c415 1118-1`try_gc(ctx=0x00007ffff7c22108, actor=0x00007fffe6c20800) at actor.c:301:3
    frame #4: 0x000000000040ba04 1118-1`ponyint_actor_run(ctx=0x00007ffff7c22108, actor=0x00007fffe6c20800, polling=false) at actor.c:373:7
    frame #5: 0x000000000041a9ba 1118-1`run(sched=0x00007ffff7c220c0) at scheduler.c:960:23
    frame #6: 0x0000000000419ed3 1118-1`run_thread(arg=0x00007ffff7c220c0) at scheduler.c:1010:3
    frame #7: 0x00007ffff7fa8609 libpthread.so.0`start_thread + 217
    frame #8: 0x00007ffff7d48293 libc.so.6`__clone + 67

In pony_traceunkown the first thing we do is:

pony_type_t* t = *(pony_type_t**)p;

And in the debugger when trying to see what t is pointing at we get:

(lldb) p t
(pony_type_t *) $3 = 0x0000000000000010
(lldb) p *t
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory

i.e. that sure looks like junk by that point.

SeanTAllen avatar Jan 27 '22 18:01 SeanTAllen

I've been playing around with this and sometimes, I get this assert on line 755 of actor.c to trigger:

  // Make sure we're not trying to send a message to an actor that is about
  // to be destroyed.
  pony_assert(!ponyint_actor_pendingdestroy(to));

I also sometimes get a crash in the atomics for sending a message itself.

I've been using a debug build of runtime and program with --ponygcinitial=1 and --ponygcfactor=1

SeanTAllen avatar Feb 06 '22 19:02 SeanTAllen

At least once, the kaboom message was a GC - acquire message

* thread #4, name = '1118', stop reason = signal SIGABRT
  * frame #0: 0x00007ffff7c6c18b libc.so.6`raise + 203
    frame #1: 0x00007ffff7c4b859 libc.so.6`abort + 299
    frame #2: 0x000000000041d47a 1118`ponyint_assert_fail(expr="!ponyint_actor_pendingdestroy(to)", file="/home/sean/code/ponylang/ponyc/src/libponyrt/actor/actor.c", line=762, func="pony_sendv") at ponyassert.c:65:3
    frame #3: 0x0000000000410d2b 1118`pony_sendv(ctx=0x00007ffff7c21a08, to=0x00007ffff7c20a00, first=0x00007fffde41f0a0, last=0x00007fffde41f0a0, has_app_msg=false) at actor.c:762:3
    frame #4: 0x000000000041121b 1118`pony_sendp(ctx=0x00007ffff7c21a08, to=0x00007ffff7c20a00, id=4294967291, p=0x00007fffde41ec40) at actor.c:889:3
    frame #5: 0x0000000000417c9a 1118`ponyint_gc_sendacquire(ctx=0x00007ffff7c21a08) at gc.c:797:5

SeanTAllen avatar Feb 06 '22 19:02 SeanTAllen

Ok so the "source" of our problem is the large block of code here:

https://github.com/ponylang/ponyc/blob/main/src/libponyrt/actor/actor.c#L444

Where we see the following:

  • an actor has completed its turn on the scheduler and it has an empty message queue
  • the actor has reference count of 0 so it is ok to delete

except, there's still a reference to it in a message and when the receiver sends an acquire message back, we go kaboom if we have already deleted ourselves or otherwise marked ourselves for deletion (which can happen in the debug case).

We apparently do not see the reference to ourself when we send (we probably aren't tracing?) and then bad things. Given that changing the receiver to an iso from a val suggests the general path that we are missing noticing that it is us.

More to look into but a start.

Commenting out the entire block that starts at line 444 of actor.c "fixes" this crash, but it does create other issues with memory growth etc that the local delete is meant to fix.

SeanTAllen avatar Feb 06 '22 20:02 SeanTAllen

Ok so from what I see...

The actor is sending a message to another actor but when it hits the end of its run with an empty queue, it sees it's own rc as 0 and so goes through the early delete and o o, when the other actor tries to send back a gc message for having taken possession of the items in the message. Kaboom.

So the question is, why on this particular send is our rc 0 instead of being higher as it should be.

The logic in question to "self delete" starts here: https://github.com/ponylang/ponyc/blob/main/src/libponyrt/actor/actor.c#L444

SeanTAllen avatar Feb 20 '22 00:02 SeanTAllen

Here's a more minimal repro which segfaults without using partial application (it uses an explicit class val to hold the actor) and a bit less of the bouncing around between different functions (the dispatch behavior directly sends the message that goes boom instead of asking the Payload class to do it):


use "collections"

actor _BoomActor
  be dispatch(request: Payload) =>
    request.holder = _BoomActorHolder(this)
    boom_behavior(consume request, Payload)

  be boom_behavior(request: Payload val, response: Payload val) =>
    None

class val _BoomActorHolder
  let boom_actor: _BoomActor
  new val create(boom_actor': _BoomActor) => boom_actor = boom_actor'

class iso Payload
  var holder: (_BoomActorHolder | None) = None

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 1_000_000) do
      t.do_it()
    end

actor Test
  be do_it() =>
    _BoomActor.dispatch(Payload)

jemc avatar Apr 26 '22 21:04 jemc

I have some ideas about what is going on here but need to discuss further with Sean.

jemc avatar Apr 26 '22 21:04 jemc

@jemc and I have a plan to address. There will be a performance impact, but correctness trumps performance. Joe will also write up a mitigation that can help offset some performance impact.

SeanTAllen avatar Apr 29 '22 19:04 SeanTAllen

To summarize what we discussed in the Zulip thread:

The current Pony runtime has a correctness bug due to what is usually a valid optimization, but in this case is not.

Specifically, the Pony runtime traces immutable (val) objects shallowly - that is, it skips tracing of fields within such objects. This saves time by reducing how much tracing has to happen, and it is described as a safe optimization in the ORCA paper, because the "outer" val object acts as an upper bound on the lifetime of the "inner" objects referred to by its fields.

While that optimization is safe within the limited scope of what was considered in the ORCA paper, the reasoning ignores the counting of actor references (which was outside the scope of the ORCA paper).

If a val object has references (either directly as its fields, or transitively as fields of its fields) to any actors, those actors need to be traced. Hence, for such an object we cannot keep this optimization in place.

But for val objects which are known via static analysis to not possibly refer to any actors, this optimization is safe and we'd like to keep it in place if possible, to keep the part of the benefit of this optimization for some workloads.

As such, we want to add a new kind of static analysis to the compiler that can classify any given data type as "definitely contains no actor references" or "may possibly contain an actor reference". If we can mark an type with the internal designation contains_no_actors, then it is valid for that type to participate in the above mentioned optimization, and the compiler should generate a trace function for that type which uses the optimized path when immutable. Otherwise, it would need to take a new pessimistic path for the sake of correctness, tracing it at runtime so that any actors it may contain are traced.

To determine if a type should be marked as contains_no_actors:

  • If any field type is an actor type, or a composite type (tuple, union, intersection) referring to an actor type => return false.

  • If any field type refers (possibly within a composite type) to a type which is not marked contains_no_actors => return false.

    • This implies we need to be able to do this analysis recursively, and it also implies we need a mechanism to break self-recursion.
    • We can likely use a similar mechanism to that used in subtype checking; that is, we can have an "assumptions list" for things that we are temporarily assuming to be marked contains_no_actors, and every time we recurse into a type we push it onto that list, such that we will surely terminate and no type will mark itself as contains_no_actors without some cause which is not itself.
  • If the type under consideration is an abstract type (such as an interface or trait), and reachability analysis shows that the abstract type has in the reachable program subsumed any type which is not marked contains_no_actors => return false.

    • This also implies recursive analysis - this time recursing into subsumed types instead of into fields.
  • Otherwise, the type has been shown to not possibly contain any actors => return true.

jemc avatar Jun 29 '22 20:06 jemc