ponyc
ponyc copied to clipboard
Segmentation fault when actor receives a reference to itself via a class created in a different actor.
_EDIT:_ Skip down to https://github.com/ponylang/ponyc/issues/1118#issuecomment-238431412 to see the most reduced example of the issue.
I gutted the HTTP server down to this, which I think is a reproduction of the seg fault in #937. I kept the original type names so that they could be somewhat related back to that package if necessary. Seems to have something to do with the partial function this~answer()
. At least if I interrupt anything after that assignment, it the seg fault disappeared.
interface val ResponseHandler
fun val apply(request: Payload val, response: Payload val): Any
interface val RequestHandler
fun val apply(request: Payload): Any
primitive Handle
fun val apply(request: Payload) =>
(consume request).respond(Payload)
actor _ServerConnection
let _handler: RequestHandler
new create(h: RequestHandler) => _handler = h
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
_handler(consume request)
be answer(request: Payload val, response: Payload val) =>
None
class iso Payload
var handler: (ResponseHandler | None) = None
fun iso respond(response': Payload) =>
try
let h = (handler) as ResponseHandler
h(consume this, consume response')
end
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection(Handle).dispatch(Payload)
This could probably be further reduced, but I wanted to maintain at least a slight semblance to the original code... and it's the middle of the night so I'm going to :sleeping:.
Here's a further reduced example:
interface val ResponseHandler
fun val apply(request: Payload val): Any
actor _ServerConnection
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
try
let h = request.handler as ResponseHandler
h(consume request)
end
be answer(request: Payload val) => None
class iso Payload
var handler: (ResponseHandler | None) = None
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection.dispatch(Payload)
If I change the dispatch
method to this, the seg fault is gone:
be dispatch(request: Payload) =>
let x: (ResponseHandler | None) = recover this~answer() end
try
let h = x as ResponseHandler
h(consume request)
end
but if I leave off the : (ResponseHandler | None)
on the x
declaration, it complains of capture violations. If I then make it recover to val
, it compiles and the seg fault is again gone.
let x = recover val this~answer() end
Or if I simply set request.handler = None
after the let h
assignment, the seg fault is gone.
actor _ServerConnection
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
try
let h = request.handler as ResponseHandler
request.handler = None
h(consume request)
end
So it seems like it has something to do with sending the _ServerConnection
into its own .answer()
method via the partial function that captures this
and gets assigned to the Payload
.
Unfortunately, adding handler = None
in the respond
method in net/http/payload.pony
does not fix it. :confused:
Given that last bit of info, I can reduce it just a bit more:
use "collections"
interface val ResponseHandler
fun val apply(request: Payload val): Any
actor _ServerConnection
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
this.answer(consume request)
be answer(request: Payload val) => None
class iso Payload
var handler: (ResponseHandler | None) = None
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection.dispatch(Payload)
Note that if I create the Payload
directly in dispatch
instead of passing it from do_it()
, we don't get the seg fault. Same if I eliminate _ServerConnection
and put all its methods right in Test
.
I tried to see if I could create the seg fault by passing the partial function directly to answer
instead of via Payload
, but wasn't able to get the type quite right for the parameter.
We really don't need the partial function at all. The same seg fault can be found by having Payload
directly reference the _ServerConnection
object.
actor _ServerConnection
be dispatch(request: Payload) =>
request.handler = recover this end
this.answer(consume request)
be answer(handler: Payload val) => None
class iso Payload
var handler: (_ServerConnection | None) = None
So this last 8 line example is enough on its own to cause a segfault? Nicely narrowing of the issue down @Perelandric
@SeanTAllen Thanks, and yes, with the Main
and Test
actors from the example above it, it should reproduce the seg fault. I'll put a full version of the smallest example in this comment.
It'll be interesting to see if this actually fixes the httpserver example. There was some other odd behavior that I noted in issue #937 which is where my investigation began, but this is certainly a piece of the puzzle if not the entire thing.
Here's the most simplified yet complete code that I could find to reproduce the issue:
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() => _ServerConnection.dispatch(Payload)
class iso Payload
var handler: (_ServerConnection | None) = None
actor _ServerConnection
be dispatch(p: Payload) =>
p.handler = recover this end
this.answer(consume p) // <-- Pass the Payload that references this _ServerConnection
be answer(p: Payload val) => None
Notes:
- In the more complete examples, the
_ServerConnection
is assigned to thePayload
via a partial function binding instead of being assigned directly. - If the
Payload
object is created directly indispatch()
instead of at the call location indo_it()
, there's no seg fault. - If you assign
None
to the.handler
field instead of the_ServerConnection
object, there's no seg fault. - The seg fault is intermittent. It's more likely to happen using a
Range
with10_000
, than with100
.
Have you tried running a debug version in LLDB to see what is causing the segfault?
@SeanTAllen Never used LLDB but this'll be as good a reason as any to tinker. I'll see if I can give it a shot tomorrow.
EDIT: Output from my first, mostly blind use of LLDB
(lldb) target create "./test"
Current executable set to './test' (x86_64).
(lldb) run
Process 18472 launched: './test' (x86_64)
Process 18472 exited with status = 0 (0x00000000)
(lldb) run
Process 18485 launched: './test' (x86_64)
Process 18485 stopped
* thread #2: tid = 18490, 0x0000000000408063 test`pony_traceunknown + 3, name = 'test', stop reason = signal SIGSEGV: invalid address (fault address: 0x40)
frame #0: 0x0000000000408063 test`pony_traceunknown + 3
test`pony_traceunknown:
-> 0x408063 <+3>: cmpq $0x0, 0x40(%rax)
0x408068 <+8>: je 0x408070 ; <+16>
0x40806a <+10>: movq 0x18(%rdi), %rax
0x40806e <+14>: jmpq *%rax
(lldb) bt
* thread #2: tid = 18490, 0x0000000000408063 test`pony_traceunknown + 3, name = 'test', stop reason = signal SIGSEGV: invalid address (fault address: 0x40)
* frame #0: 0x0000000000408063 test`pony_traceunknown + 3
frame #1: 0x00000000004020ef test`Payload_Trace + 15
frame #2: 0x000000000040e360 test`ponyint_gc_handlestack + 64
frame #3: 0x0000000000407fa0 test`ponyint_mark_done + 32
frame #4: 0x0000000000409304 test`ponyint_actor_run + 772
frame #5: 0x0000000000405b85 test`run + 69
frame #6: 0x0000000000405edd test`run_thread + 29
frame #7: 0x00007ffff7bc4184 libpthread.so.0`start_thread + 196
frame #8: 0x00007ffff74d737d libc.so.6`clone + 109
(lldb)
Had trouble setting breakpoints.
@SeanTAllen One more observation: It seems if the parameter of the answer()
method is tag
or iso
, there's no segfault. Only a problem with val
.
// -------------------v--- iso and tag are OK, but not val
be answer(p: Payload val) => None
Just out of curiosity, is there any news on this one? After watching the VUG about garbage collection I wondered if the error should be looked for in that area. Just a wild guess.
Nothing yet.
Difficult one I guess
And we could use more folks contributing. Its hard to keep up and do the day job and all that. Such is things... ¯_(ツ)_/¯
Hrm ... I just tried @Perelandric's minimal example from August 8 and I didn't get a seg fault. I tried both a debug and release build on OS X (10.11.4) using compiler version 0.3.0-14-gd8c6c55.
@perelandric can you see if this still happens for you?
@SeanTAllen Yep, took a few extra runs before it would segfault, but it got there.
@aturley Try upping the range from 10_000
to 100_000
. That gives me the segfault more reliably.
I got a little more information from lldb:
(lldb) run
Process 12481 launched: './build/issue-1118' (x86_64)
Process 12481 stopped
* thread #2: tid = 0x3422ef, 0x000000010000b862 issue-1118`pony_traceunknown(ctx=0x00000001097ffc48, p=0x00000001217dd000, m=2) + 34 at trace.c:114, stop reason = EXC_BAD_ACCESS (code=1, address=0x40)
frame #0: 0x000000010000b862 issue-1118`pony_traceunknown(ctx=0x00000001097ffc48, p=0x00000001217dd000, m=2) + 34 at trace.c:114
111 {
112 pony_type_t* t = *(pony_type_t**)p;
113
-> 114 if(t->dispatch != NULL)
115 {
116 ctx->trace_actor(ctx, (pony_actor_t*)p);
117 } else {
Something is going a little sideways here in GC land.
Yeah. At the moment, this is waiting on @sylvanc or I having time to address. At the moment, I'm working on new website before returning back to the land of pony code. I'm not going to get to this til sometime in mid October. If anyone wants to take a shot at it before then, have at it.
Hi, seems like by a happy coincidence it's just a simple typo.
- pony_type_t* t = *(pony_type_t**)p;
+ pony_type_t* t = (pony_type_t*)p;
After a quick fix all is fine. Here is the cure https://github.com/ponylang/ponyc/pull/1249
@dimiii all CI jobs failed with a segfault with that change applied. I left some questions for you on #1249.
It looks like this is an uninitialised type descriptor pointer problem. I'll try digging into it.
I think I got it. This is a premature free issue.
Full tracing of val
objects sent in messages is deferred until a GC cycle on the owner. This means the _ServerConnection
assigned to the Payload
and then sent in a message (via the Payload
) has a reference count greater than 0 but isn't aware of it until the Payload
is traced. By then, the _ServerConnection
might have been GC'd, which of course results in bad stuff happening.
@Praetonus @SeanTAllen
BTW, Do you have plans to bootstrap the compiler? I mean the compiler and runtime.
@dimiii - Building a Pony compiler in Pony is a long-term goal, though it represents a major amount of work. There are no plans to write the runtime in Pony, as it is built on lower-level constructs that would be impractical to provide in Pony.
@dimiii re: the runtime point @jemc from a safety standpoint, a Pony runtime written in something like Rust would be awesome. However, C is far more accessible to the community and as such, the plans are to stay with C for the forseeable future for the runtime.
@Praetonus should this be relabeled from "ready for work"?
I'm not sure. I can see a fix for this but I think it would have some performance penalty. Maybe removing it until we've discussed it on the sync call would make sense.
Stability should be more important, imho. I would not advocate using a tool for server applications if it could occasionally crash.
Especially when the design of the language is to eliminate crashes caused by the programmer.
no one is suggesting not fixing. the concern is, can we find a way to fix without having a performance impact.
@peterzandbergen - it definitely still needs to be fixed, and it's still marked as high priority. Runtime segmentation faults are not acceptable .
Ok, issue located. @sylvanc will be working on this. Thank you everyone who contributed to finding this. Extra props to @Perelandric for lots of narrowing down a minimal case and @Praetonus for digging into the code.
Yes, thank you VERY much! This is an important and subtle and tricky one.
Well done, never doubted that this would not be picked up in a serious way. Kudos
I'm trying to verify that #1507 fixes this issue but I'm unable to reproduce the reported problem.
OSX El Capitan. LLVM 3.9.1
@Praetonus @Perelandric @aturley can any of you reproduce?
Both the original code and the most simplified code here give me "can't find declaration of 'Range'". Why?
@jkleiser You have to add use "collections"
at the beginning of the code.
Thanks @Praetonus . The simplified code now got compiled and ran without problems on my Mac with ponyc 0.10.0-74555f7.
The simplified code works on my Ubuntu machine. Both optimized and with ponyc --debug.
0.10.0-1c33065 [release] compiled with: llvm 3.9.0 -- gcc-5 (Ubuntu 5.4.1-2ubuntu1~14.04) 5.4.1 20160904
However, when I run the original code I get Segmentation fault (core dumped)
I have noticed that the issue did not occur in my environment anymore, being Ubuntu Mate in virtual box. Will test again this week to seeing if the problem has been resolved.
Op ma 23 jan. 2017 07:54 schreef Robert Hastings [email protected]:
The simplified code works on my Ubuntu machine. Both optimized and with ponyc --debug.
0.10.0-1c33065 [release] compiled with: llvm 3.9.0 -- gcc-5 (Ubuntu 5.4.1-2ubuntu1~14.04) 5.4.1 20160904
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ponylang/ponyc/issues/1118#issuecomment-274413255, or mute the thread https://github.com/notifications/unsubscribe-auth/AI2XrpZfLdrKh8_toJs6Mh3PLyCqZ6Vhks5rVE6_gaJpZM4JeyOG .
--
Peter Zandbergen Tel: +31 6 460 55 872
linkedIn: peterzandbergen https://www.linkedin.com/in/peterzandbergen
@peterzandbergen See the discussion at: https://github.com/ponylang/ponyc/pull/1507
Just an update. @sylvanc is still working on a fix for this.
Original version still segfaults as of 0.21.0
The original version of this segfaults if I add some @printf output that slows everything down some. But, only when doing an optimized build.
0.32.0-a7a80758 [release] compiled with: llvm 7.0.1 -- cc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 Defaults: pic=true
The original version of this segfaults for me but none of the others. I looked at it for a while and I'm somewhat confused. I hope to be able to spend more time on it and maybe get some more eyes on it.
The original version that segfaults doesn't if the usage of Payload val
is changed to Payload
so that an iso
never becomes a val
. There's still a bug there.
works:
use "collections"
interface val ResponseHandler
fun val apply(request: Payload, response: Payload): Any
interface val RequestHandler
fun val apply(request: Payload): Any
primitive Handle
fun val apply(request: Payload) =>
(consume request).respond(Payload)
actor _ServerConnection
let _handler: RequestHandler
new create(h: RequestHandler) => _handler = h
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
_handler(consume request)
be answer(request: Payload, response: Payload) =>
None
class iso Payload
var handler: (ResponseHandler | None) = None
fun iso respond(response': Payload) =>
try
let h = (handler) as ResponseHandler
h(consume this, consume response')
end
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection(Handle).dispatch(Payload)
This probably means that Benoit's statement here:
I think I got it. This is a premature free issue.
Full tracing of val objects sent in messages is deferred until a GC cycle on the owner. This means the >_ServerConnection assigned to the Payload and then sent in a message (via the Payload) has a reference count >greater than 0 but isn't aware of it until the Payload is traced. By then, the _ServerConnection might have been >GC'd, which of course results in bad stuff happening.
might still be true.
When I look at it in the debugger, it sure seems like what we are pointing at in trace_unknown is now junk. That said, I'm not positive yet.
Here's an updated example for what still fails:
use "collections"
interface val ResponseHandler
fun val apply(request: Payload val, response: Payload): Any
interface val RequestHandler
fun val apply(request: Payload): Any
primitive Handle
fun val apply(request: Payload) =>
(consume request).respond(Payload)
actor _ServerConnection
let _handler: RequestHandler
new create(h: RequestHandler) => _handler = h
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
_handler(consume request)
be answer(request: Payload val, response: Payload) =>
None
class iso Payload
var handler: (ResponseHandler | None) = None
fun iso respond(response': Payload) =>
try
let h = (handler) as ResponseHandler
h(consume this, consume response')
end
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 1_000_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection(Handle).dispatch(Payload)
The important part of this is that the request
object for ResponseHandler
is going from an iso when sent to a val on receive. If it isn't a val
on receive then all is good.
I'm not sure exactly where the bug is however, an important difference introduced here is that in recv_remote_object
we do not recurse the object if it is immutable. IE this code:
if(!obj->immutable)
recurse(ctx, p, t->trace);
is executed when Payload is also an iso
and then we have no crash.
Here's the backtrace:
* thread #8, name = '1118-1', stop reason = signal SIGSEGV: invalid address (fault address: 0x58)
* frame #0: 0x00000000004158ab 1118-1`pony_traceunknown(ctx=0x00007ffff7c22108, p=0x00007fffde3f1400, m=1) at trace.c:127:18
frame #1: 0x000000000041360e 1118-1`ponyint_gc_handlestack(ctx=0x00007ffff7c22108) at gc.c:677:5
frame #2: 0x0000000000415629 1118-1`ponyint_mark_done(ctx=0x00007ffff7c22108) at trace.c:75:3
frame #3: 0x000000000040c415 1118-1`try_gc(ctx=0x00007ffff7c22108, actor=0x00007fffe6c20800) at actor.c:301:3
frame #4: 0x000000000040ba04 1118-1`ponyint_actor_run(ctx=0x00007ffff7c22108, actor=0x00007fffe6c20800, polling=false) at actor.c:373:7
frame #5: 0x000000000041a9ba 1118-1`run(sched=0x00007ffff7c220c0) at scheduler.c:960:23
frame #6: 0x0000000000419ed3 1118-1`run_thread(arg=0x00007ffff7c220c0) at scheduler.c:1010:3
frame #7: 0x00007ffff7fa8609 libpthread.so.0`start_thread + 217
frame #8: 0x00007ffff7d48293 libc.so.6`__clone + 67
In pony_traceunkown
the first thing we do is:
pony_type_t* t = *(pony_type_t**)p;
And in the debugger when trying to see what t
is pointing at we get:
(lldb) p t
(pony_type_t *) $3 = 0x0000000000000010
(lldb) p *t
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
i.e. that sure looks like junk by that point.
I've been playing around with this and sometimes, I get this assert on line 755 of actor.c to trigger:
// Make sure we're not trying to send a message to an actor that is about
// to be destroyed.
pony_assert(!ponyint_actor_pendingdestroy(to));
I also sometimes get a crash in the atomics for sending a message itself.
I've been using a debug build of runtime and program with --ponygcinitial=1 and --ponygcfactor=1
At least once, the kaboom message was a GC - acquire message
* thread #4, name = '1118', stop reason = signal SIGABRT
* frame #0: 0x00007ffff7c6c18b libc.so.6`raise + 203
frame #1: 0x00007ffff7c4b859 libc.so.6`abort + 299
frame #2: 0x000000000041d47a 1118`ponyint_assert_fail(expr="!ponyint_actor_pendingdestroy(to)", file="/home/sean/code/ponylang/ponyc/src/libponyrt/actor/actor.c", line=762, func="pony_sendv") at ponyassert.c:65:3
frame #3: 0x0000000000410d2b 1118`pony_sendv(ctx=0x00007ffff7c21a08, to=0x00007ffff7c20a00, first=0x00007fffde41f0a0, last=0x00007fffde41f0a0, has_app_msg=false) at actor.c:762:3
frame #4: 0x000000000041121b 1118`pony_sendp(ctx=0x00007ffff7c21a08, to=0x00007ffff7c20a00, id=4294967291, p=0x00007fffde41ec40) at actor.c:889:3
frame #5: 0x0000000000417c9a 1118`ponyint_gc_sendacquire(ctx=0x00007ffff7c21a08) at gc.c:797:5
Ok so the "source" of our problem is the large block of code here:
https://github.com/ponylang/ponyc/blob/main/src/libponyrt/actor/actor.c#L444
Where we see the following:
- an actor has completed its turn on the scheduler and it has an empty message queue
- the actor has reference count of 0 so it is ok to delete
except, there's still a reference to it in a message and when the receiver sends an acquire message back, we go kaboom if we have already deleted ourselves or otherwise marked ourselves for deletion (which can happen in the debug case).
We apparently do not see the reference to ourself when we send (we probably aren't tracing?) and then bad things. Given that changing the receiver to an iso from a val suggests the general path that we are missing noticing that it is us.
More to look into but a start.
Commenting out the entire block that starts at line 444 of actor.c "fixes" this crash, but it does create other issues with memory growth etc that the local delete is meant to fix.
Ok so from what I see...
The actor is sending a message to another actor but when it hits the end of its run with an empty queue, it sees it's own rc as 0 and so goes through the early delete and o o, when the other actor tries to send back a gc message for having taken possession of the items in the message. Kaboom.
So the question is, why on this particular send is our rc 0 instead of being higher as it should be.
The logic in question to "self delete" starts here: https://github.com/ponylang/ponyc/blob/main/src/libponyrt/actor/actor.c#L444
Here's a more minimal repro which segfaults without using partial application (it uses an explicit class val
to hold the actor) and a bit less of the bouncing around between different functions (the dispatch
behavior directly sends the message that goes boom instead of asking the Payload
class to do it):
use "collections"
actor _BoomActor
be dispatch(request: Payload) =>
request.holder = _BoomActorHolder(this)
boom_behavior(consume request, Payload)
be boom_behavior(request: Payload val, response: Payload val) =>
None
class val _BoomActorHolder
let boom_actor: _BoomActor
new val create(boom_actor': _BoomActor) => boom_actor = boom_actor'
class iso Payload
var holder: (_BoomActorHolder | None) = None
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 1_000_000) do
t.do_it()
end
actor Test
be do_it() =>
_BoomActor.dispatch(Payload)
I have some ideas about what is going on here but need to discuss further with Sean.
@jemc and I have a plan to address. There will be a performance impact, but correctness trumps performance. Joe will also write up a mitigation that can help offset some performance impact.
To summarize what we discussed in the Zulip thread:
The current Pony runtime has a correctness bug due to what is usually a valid optimization, but in this case is not.
Specifically, the Pony runtime traces immutable (val
) objects shallowly - that is, it skips tracing of fields within such objects. This saves time by reducing how much tracing has to happen, and it is described as a safe optimization in the ORCA paper, because the "outer" val
object acts as an upper bound on the lifetime of the "inner" objects referred to by its fields.
While that optimization is safe within the limited scope of what was considered in the ORCA paper, the reasoning ignores the counting of actor references (which was outside the scope of the ORCA paper).
If a val
object has references (either directly as its fields, or transitively as fields of its fields) to any actors, those actors need to be traced. Hence, for such an object we cannot keep this optimization in place.
But for val
objects which are known via static analysis to not possibly refer to any actors, this optimization is safe and we'd like to keep it in place if possible, to keep the part of the benefit of this optimization for some workloads.
As such, we want to add a new kind of static analysis to the compiler that can classify any given data type as "definitely contains no actor references" or "may possibly contain an actor reference". If we can mark an type with the internal designation contains_no_actors
, then it is valid for that type to participate in the above mentioned optimization, and the compiler should generate a trace function for that type which uses the optimized path when immutable. Otherwise, it would need to take a new pessimistic path for the sake of correctness, tracing it at runtime so that any actors it may contain are traced.
To determine if a type should be marked as contains_no_actors
:
-
If any field type is an
actor
type, or a composite type (tuple, union, intersection) referring to anactor
type => returnfalse
. -
If any field type refers (possibly within a composite type) to a type which is not marked
contains_no_actors
=> returnfalse
.- This implies we need to be able to do this analysis recursively, and it also implies we need a mechanism to break self-recursion.
- We can likely use a similar mechanism to that used in subtype checking; that is, we can have an "assumptions list" for things that we are temporarily assuming to be marked
contains_no_actors
, and every time we recurse into a type we push it onto that list, such that we will surely terminate and no type will mark itself ascontains_no_actors
without some cause which is not itself.
-
If the type under consideration is an abstract type (such as an
interface
ortrait
), and reachability analysis shows that the abstract type has in the reachable program subsumed any type which is not markedcontains_no_actors
=> returnfalse
.- This also implies recursive analysis - this time recursing into subsumed types instead of into fields.
-
Otherwise, the type has been shown to not possibly contain any actors => return
true
.