bril
bril copied to clipboard
The Brilirs memory extension implementation is very slow
As was shown in https://github.com/sampsyo/bril/pull/189#issuecomment-1091979145. The current memory extension implementation for brilirs
is surprisingly slow.
Currently, there are no good benchmarks that are memory intensive enough which makes it difficult to know what to optimize for. The closest program seems to be test/mem/alloc_large.bril
which only allocates and frees the same sized chunk of memory in a loop.
The current implementation of Heap
which supports the memory extension in brilirs
is ported from brili
and could be greatly improved.
A couple of possible pain points include.
- The
Heap
is implemented as a map from the "base" of the pointer to aVec<Value>
. This is nice in that it grows and shrinks with the amount of memory in use at any given. However, the use of maps likeHashMap<usize, Vec<Value>>
has historically been a performance hit inbrilirs
due to needing to hash the key and perform the lookup.- Ideally we would compress the Heap into either a
Vec<Vec<Value>>
or aVec<Value>
with the "base" of the pointer just indexing into the start of it's allocation in the array. In either case we will want the ability to reuse indexes as allocations are freed up. - The implementation using
Vec<Vec<Value>>
is probably slower than the alternative because it will require two array access to read/write to a value. However, it's probably easier to implement. -
Vec<Value>
is more along the lines of implementing something like malloc which gives all of the performance benefits / fragmentation complexities involved.- Unlike malloc however, one could implement a form of garbage collection and reassign the indexes of pointers since this implementation detail is not exposed to the user. It's not clear to me how this would affect performance.
- Regardless of the implementation, the
Heap
needs to still enforce errors on buffer over/underflows, use after frees, use of uninitialized memory, and memory leaks.
- Ideally we would compress the Heap into either a
- The current implementation allocates a new
Vec<Value>
on every call to alloc. Ideally,brilirs
should be able to reuse previously freed allocations. This would especially targetalloc_large.bril
and potentially provide a significant improvement overbrili
.- Two possible solutions for this are either the previously mentioned
Heap
as just one largeVec<Value>
or using something like a slab allocator and delegating this reuse to an external library.
- Two possible solutions for this are either the previously mentioned
- Unlike with stack allocation where it is statically known how many variables are in scope, memory allocation is dynamic which limits the interpreters ability to pre-allocate capacity for the
Heap
to use during the course of interpretation.-
brilirs
already employs some static analysis for type checking and it would be interesting if it could use a data flow analysis to get a very rough estimate on whether it should start out with a small or largeHeap
size. Some programs don't use the memory extension or use it sparingly which is the default case. On the other hand, we also want to optimize for programs which make of intense use of memory.
-
Indeed—using a flat vector of values could really help.
But I also agree with your initial premise: at the moment, we don't have many benchmarks that really do much intense with memory, mostly because it's hard to write these by hand and we haven't compiled any realistic applications from higher-level languages. Doing that would be a whole project until itself but would also really help stress the GC task for 6120 (https://github.com/sampsyo/cs6120/discussions/297)…
am working on (mostly have) an interpreter that will have memory that is natively allocated using malloc/free, and is showing a 97-98% speedup over brili, and about 84% improvement over brilirs!
Hey, I assume your referencing https://github.com/sampsyo/cs6120/issues/304? It's super cool that another "fast" interpreter is being implemented and I'm pretty interested in it's comparison against brilirs
.
I took a look at your code and it makes sense that your interpreter will be faster(it seems to elide the checks that brilirs
has to return errors, which might make it a good comparison of the excess overhead in brilirs
). For the memory extension, you seem to just have each of the Bril pointer values as a pointer to the memory directly. It might be interesting to see if brilirs
could adopt something similar(perhaps using reference counting or one could use unsafe rust).
One observation about this particular test. Rust(for safety) allocates a default value for each element in the array(unless you use https://doc.rust-lang.org/stable/std/mem/union.MaybeUninit.html which maybe brilirs
should) which means that the memory that brilirs
allocates in this test gets written to even though the program doesn't touch it. Having previously read https://lemire.me/blog/2021/10/27/in-c-how-do-you-know-if-the-dynamic-allocation-succeeded/ , I wonder if maybe your interpreter is optimized for this case by not actually allocating the memory until it gets used in the program. Some quick profiling of brilirs
seems to suggest this is the case but I'm curious if you looked at this in your interpreter?
i mean this is down to the OS/implementation of malloc, right? I'm not zeroing out the memory, so i'd assume that yes, this optimization will happen