Better multi-thread concurrency
Right now there's One Big Lock, which will slow things down for large-scale computation.
Here is an algorithm that should work, w/Python sketch.
EDIT: Not sure it works, thinking about it some more, in particular when allocations switch threads.
Basic idea is that the peak/callstack-id data is buffered per thread at 1KB resolution. Worst case the results of profiling are off by 1KB * concurrent threads.
def add_allocation(address, size):
callstack_id = get_callstack_id()
# 1. Track allocation so free() has right info
# Alternatively, dashmap or some other concurrent map might be even better.
with all_allocs_lock:
global.all_allocs[address] = (callstack_id, size)
# 2. Add allocation to thread-local cache
thread_local.allocated += size
thread_local.allocations.append((callstack_id, size))
# 3. If necessary, update global data structure by flushing thread-local buffer:
if thread_local.allocated > 1024 or len(thread_local.allocations) > 100:
process_allocations(thread_local.allocations)
thread_local.allocated = 0
thread_local.allocations = []
def process_allocations(allocations):
for callstack_id, size in allocations:
global.allocated += size
global.callstack_allocations[callstack_id] += size
if global.allocated > global.peak_allocated:
global.peak_allocated = global.allocated
global.callstack_peak_allocations = globa.callstack_allocations.clone()
Another thought: technically assuming a global map for the address → (callstack id, size) mapping, you don't actually have to delete entries when freeing memory, just read them. This would cut number of writes to the map by (at most) half, with reads instead, so if the map uses read/write lock(s) this would improve concurrency at the cost of higher memory usage.
A different approach for multi-threading: single thread manages all the data structures, all teh others push commands to it. E.g. using desync.
So far that's slow. Might have to return to original idea.
Couple of things to try with desync-y solution:
- rwlock on interner, instead of just lock
- replace box-of-closure with custom enum-of-all-supported-commands in channel, to get rid of box.
Didn't work.
New idea, simpler version of first idea:
- Each thread keeps queue of allocations and frees. If it hits abs(512) (since untracked allocs might result in untracked frees, doubling impact) or N items queued, dump them all in one write transaction doing current logic.
- Likely requires RW lock at least on main hashmap for free()s so they can figure out hwo much they're freeing, or perhaps
malloc_usable_size(need benchmarking; on linux would be using jemalloc's... https://lemire.me/blog/2017/09/15/how-fast-are-malloc_size-and-malloc_usable_size-in-c/) - Would need #93 or some other way to get rid of that lock.
Maybe current code is slow due to bad cache lines? Worth checking.
After further thought, sems unlikely.