bril The Brilirs memory extension implementation is very slow

As was shown in https://github.com/sampsyo/bril/pull/189#issuecomment-1091979145. The current memory extension implementation for brilirs is surprisingly slow.

Currently, there are no good benchmarks that are memory intensive enough which makes it difficult to know what to optimize for. The closest program seems to be test/mem/alloc_large.bril which only allocates and frees the same sized chunk of memory in a loop.

The current implementation of Heap which supports the memory extension in brilirs is ported from brili and could be greatly improved.

A couple of possible pain points include.

The Heap is implemented as a map from the "base" of the pointer to a Vec<Value>. This is nice in that it grows and shrinks with the amount of memory in use at any given. However, the use of maps like HashMap<usize, Vec<Value>> has historically been a performance hit in brilirs due to needing to hash the key and perform the lookup.
- Ideally we would compress the Heap into either a Vec<Vec<Value>> or aVec<Value> with the "base" of the pointer just indexing into the start of it's allocation in the array. In either case we will want the ability to reuse indexes as allocations are freed up.
- The implementation using Vec<Vec<Value>> is probably slower than the alternative because it will require two array access to read/write to a value. However, it's probably easier to implement.
- Vec<Value> is more along the lines of implementing something like malloc which gives all of the performance benefits / fragmentation complexities involved.
  - Unlike malloc however, one could implement a form of garbage collection and reassign the indexes of pointers since this implementation detail is not exposed to the user. It's not clear to me how this would affect performance.
- Regardless of the implementation, the Heap needs to still enforce errors on buffer over/underflows, use after frees, use of uninitialized memory, and memory leaks.
The current implementation allocates a new Vec<Value> on every call to alloc. Ideally, brilirs should be able to reuse previously freed allocations. This would especially target alloc_large.bril and potentially provide a significant improvement over brili.
- Two possible solutions for this are either the previously mentioned Heap as just one large Vec<Value> or using something like a slab allocator and delegating this reuse to an external library.
Unlike with stack allocation where it is statically known how many variables are in scope, memory allocation is dynamic which limits the interpreters ability to pre-allocate capacity for the Heap to use during the course of interpretation.
- brilirs already employs some static analysis for type checking and it would be interesting if it could use a data flow analysis to get a very rough estimate on whether it should start out with a small or large Heap size. Some programs don't use the memory extension or use it sparingly which is the default case. On the other hand, we also want to optimize for programs which make of intense use of memory.

Apr 07 '22 20:04 Pat-Lafon

Indeed—using a flat vector of values could really help.

But I also agree with your initial premise: at the moment, we don't have many benchmarks that really do much intense with memory, mostly because it's hard to write these by hand and we haven't compiled any realistic applications from higher-level languages. Doing that would be a whole project until itself but would also really help stress the GC task for 6120 (https://github.com/sampsyo/cs6120/discussions/297)…

Apr 07 '22 21:04 sampsyo

am working on (mostly have) an interpreter that will have memory that is natively allocated using malloc/free, and is showing a 97-98% speedup over brili, and about 84% improvement over brilirs!

May 08 '22 16:05 charles-rs

Hey, I assume your referencing https://github.com/sampsyo/cs6120/issues/304? It's super cool that another "fast" interpreter is being implemented and I'm pretty interested in it's comparison against brilirs.

I took a look at your code and it makes sense that your interpreter will be faster(it seems to elide the checks that brilirs has to return errors, which might make it a good comparison of the excess overhead in brilirs). For the memory extension, you seem to just have each of the Bril pointer values as a pointer to the memory directly. It might be interesting to see if brilirs could adopt something similar(perhaps using reference counting or one could use unsafe rust).

One observation about this particular test. Rust(for safety) allocates a default value for each element in the array(unless you use https://doc.rust-lang.org/stable/std/mem/union.MaybeUninit.html which maybe brilirs should) which means that the memory that brilirs allocates in this test gets written to even though the program doesn't touch it. Having previously read https://lemire.me/blog/2021/10/27/in-c-how-do-you-know-if-the-dynamic-allocation-succeeded/ , I wonder if maybe your interpreter is optimized for this case by not actually allocating the memory until it gets used in the program. Some quick profiling of brilirs seems to suggest this is the case but I'm curious if you looked at this in your interpreter?

flamegraph

May 08 '22 19:05 Pat-Lafon

i mean this is down to the OS/implementation of malloc, right? I'm not zeroing out the memory, so i'd assume that yes, this optimization will happen

May 09 '22 17:05 charles-rs

bril bril copied to clipboard

The Brilirs memory extension implementation is very slow

bril
bril copied to clipboard