proxylib Free Is Noticeably Slower Than Direct UMF Pool Under Multithreaded Workloads
In proxylib, every free operation must check whether the pointer being freed belongs to the "leak pool." The leak pool is a workaround for recursive allocations when the malloc function (overridden by proxylib) triggers other call to malloc (often through libraries like hwloc).
This check is performed under a lock, causing threads to synchronize on every free. This results in significant overhead under multithreaded loads. Although #1072 increases the size of the pool to reduce the time spent under this lock, the goal should be to remove the lock entirely.
Two approaches are under consideration:
-Use Atomic Operations Instead of a Mutex The leak pool consists of multiple smaller pools linked together. When they are all full, a new pool is created. Instead of relying on a lock, we can manage this pool list with atomic compare-and-swap operations.
-Use a Single Large Pool Rather than maintaining multiple pools, create a large anonymous mmap (with PROT_NONE). If more space is needed, simply change the protection flags for new pages. This removes the need for locking to verify whether a pointer belongs to the pool. On Windows, VirtualAlloc can be used similarly to reserve and commit pages on-demand.
Below is a flame graph illustrating performance after #1072
BTW, I believe the tbbmalloc_proxy already solves the same issue.
We should look at how they are dealing with the issue.
This is exactly why critnib was made to have wait-free reads.
I also think that it is solved problem, and it should be easy to fix. We just must select best option, as there is multiple fix options to chose.