Add escape / lifetime analysis to allocate objects in different heaps / arenas based on their lifetime

Open plajjan opened this issue 1 year ago • 1 comments

We should add escape / lifetime analysis so that we can place each variable in one of multiple categories of lifetime scopes which in turn allows us to allocate this where it makes the most sense.

The idea is that based on such lifetimes we could place allocations in a memory arena. I suppose it's possible to implement arenas in different ways but I'm thinking something like:

arenas have a malloc that allocate within the arena
arenas are freed in one go
arena.free() is called at some relatively deterministic time, which puts an upper bound to the lifetime of the entire arena
arenas are variable sized, i.e. if they "run out" of memory the will in turn allocate from a backing allocator, either normal malloc or perhaps from OS via mmap
arenas could be very compact by only allocating small amounts of memory from backing allocator but they can probably become more efficient by allocating in somewhat larger chunks
we do not GC within an arena, all resources are just freed when the arena is freed
arenas are really neat in that this kind of use case is bound to a worker thread, so both allocation and freeing happens locally in a thread, thus no locks

Some lifetimes and how we could allocate objects:

object lifetime: attributes of an object can be grouped under the lifetime of the object
- probably requires the object to be immutable or that the values are word sized?
- like it makes sense to have an unboxed i64 as just part of the object class struct rather than allocate it separately and we'd pass the value by value and not as a pointer reference!?
- for an immutable object, it's a given that all attributes share the lifetime of the object
- GC speedup is essentially x times how many attributes we have on our objects, so the more attributes the more speedup
function lifetime: each stack frame / nested function is a lifetime
- for this lifetime we have a choice of placing things on the stack, so for the return value from a called function, we can place the result variable on the stack of the parent function
- but large values or variable sized values probably shouldn't go on stack
- so we can have an arena where a child
continuation lifetime: we can handle lifetimes of all stacked functions like this, trivially, up to the level of actor method continuation
- since a actor method continuation runs in one go by a worker thread, until it has completed, this works
- for things that have a lifetime longer than a continuation, like an actor variable attribute, we cannot use arenas as per above, since there would be no "deterministic" time to call arena.free(). Only when an actor is dead / unreachable would we be able to free() and that's just too late, we would need to run GC in between since actor lifetimes can be really long
method lifetime: there is one more, slightly more intricate, lifetime, for variables that do not leave a method but where the method has been chopped up in multiple continuations
- if we use an arena for this, it would survive longer than a continuation and thus be sort of resting between worker threads executing code
- unlike the other arenas, which we could just keep references to from the worker threads worker context or simiar, an arena for the method lifetime would need to be stored somewhere else, related to the Actor / Msg and reachable from readyQ or so
- arena can be cleared when last continuation in current method is done
- could this be long-lived enough that we'd need GC?
actor lifetime: actors private state
- not shared with other actors, so no references to keep track of
- needs to be GCed though, not like short lived arenas
- GC can be per actor, no locks nor anything... weee!
global heap
- for things sent between actors
- reference counting is probably the way here for the long term
- Pony's ORCA GC uses the actor message subsystem to send GC reference / dereference messages - something we could use as well, which would naturally scale out well for actors both in a local system as well as in a distributed environment

Analyze the scope of variables to find variables that we can place in lifetime limited arenas. The simplest example is that of a variable that is temporary and only used within the scope of one function. The next step is moving return values to be placed on the stack/arena of the outer / calling function.

What triggered me to look into this is the bad performance of the GC and trying to lessen the load on the GC-heap but using arenas is no temporary hack. Even with a better GC, I think we want multiple approaches to memory management, and doing static lifetime analysis is probably the best foundation for that either way. The way we need to do GC for inter-actor messages across a distributed system fundamentally has different requirements than a GC for a shorter lifetime. Like even if we say there are problems with arenas so that we want a GC for say the continuation lifetime, we could then have a GC that works differently than the inter-actor GC, doing less things (because no thread synchronization required!) and thus be faster for that type of work.

I don't see a world in where lifetime analysis would be bad, it is always a good foundation!

I think it's also really encouraging that we can implement lifetime analysis and then based on knowing some of the shorter lifetimes, we can implement arenas and take on this work in smaller pieces. Our current GC remains the bigger hindrance to performance, so anything we do to lessen the load on it, is a very good thing!

Jan 25 '24 08:01 plajjan

@nordlander @sydow I updated this a bit based on my most recent thoughts. Feel free to read & comment.

I suppose I could also mention that the sort of reason why this popped up now is that we recently got the PRW exporter in Telemetrify and I wrote a small program to just generate mock data and export that to a time series database. It's doing like 20% of the total amount of work that Telemetrify would normally do, so I know Telemetrify will run ever slower. My example program is more like the upper end of the performance but it doesn't run nearly as fast as I want. I guess my expectations were quite low already even though my hopes were higher and long term I need to be muuuuch faster. I haven't fully benchmarked things here but I believe the GC, as always, is the main source of poor performance and so I began thinking of how to improve it.. and voila, this issue :)

May 08 '24 08:05 plajjan