filprofiler Better multi-thread concurrency

Right now there's One Big Lock, which will slow things down for large-scale computation.

Sep 28 '20 00:09 itamarst

Here is an algorithm that should work, w/Python sketch.

EDIT: Not sure it works, thinking about it some more, in particular when allocations switch threads.

Basic idea is that the peak/callstack-id data is buffered per thread at 1KB resolution. Worst case the results of profiling are off by 1KB * concurrent threads.

def add_allocation(address, size):
    callstack_id = get_callstack_id()

    # 1. Track allocation so free() has right info
    #     Alternatively, dashmap or some other concurrent map might be even better.
    with all_allocs_lock:
        global.all_allocs[address] = (callstack_id, size)
    
    # 2. Add allocation to thread-local cache
    thread_local.allocated += size
    thread_local.allocations.append((callstack_id, size))

    # 3. If necessary, update global data structure by flushing thread-local buffer:
    if thread_local.allocated > 1024 or len(thread_local.allocations) > 100:
        process_allocations(thread_local.allocations)
        thread_local.allocated = 0
        thread_local.allocations = []

def process_allocations(allocations):
    for callstack_id, size in allocations:
        global.allocated += size
        global.callstack_allocations[callstack_id] += size
    if global.allocated > global.peak_allocated:
        global.peak_allocated = global.allocated
        global.callstack_peak_allocations = globa.callstack_allocations.clone()

Oct 09 '20 20:10 itamarst

Another thought: technically assuming a global map for the address → (callstack id, size) mapping, you don't actually have to delete entries when freeing memory, just read them. This would cut number of writes to the map by (at most) half, with reads instead, so if the map uses read/write lock(s) this would improve concurrency at the cost of higher memory usage.

Oct 24 '20 14:10 itamarst

A different approach for multi-threading: single thread manages all the data structures, all teh others push commands to it. E.g. using desync.

Oct 25 '20 11:10 itamarst

So far that's slow. Might have to return to original idea.

Nov 04 '20 15:11 itamarst

Couple of things to try with desync-y solution:

rwlock on interner, instead of just lock
replace box-of-closure with custom enum-of-all-supported-commands in channel, to get rid of box.

Nov 05 '20 01:11 itamarst

Didn't work.

New idea, simpler version of first idea:

Each thread keeps queue of allocations and frees. If it hits abs(512) (since untracked allocs might result in untracked frees, doubling impact) or N items queued, dump them all in one write transaction doing current logic.
Likely requires RW lock at least on main hashmap for free()s so they can figure out hwo much they're freeing, or perhaps malloc_usable_size (need benchmarking; on linux would be using jemalloc's... https://lemire.me/blog/2017/09/15/how-fast-are-malloc_size-and-malloc_usable_size-in-c/)
Would need #93 or some other way to get rid of that lock.

Nov 06 '20 02:11 itamarst

Maybe current code is slow due to bad cache lines? Worth checking.

Nov 07 '20 01:11 itamarst

After further thought, sems unlikely.

Nov 07 '20 01:11 itamarst