Avoid false-sharing
False-sharing is pretty bad for performance of lock-free algorithms. There's many known results. This classic paper is a little old but for example:
- 80% reduction in wall-clock time on linear regression claimed by Intel's coobkook.
- In the case of multicore OCaml, @anmolsahoo25 found a 23% reduction in wall-clock time on a benchmark with just two threads.
We've had some discussion on lockfree. There's probably at least 2 acceptable solutions there. The rest of this issue describes both. Let me know what you think, I'm happy to open a PR with either.
Expose memory-aligned allocation
A somewhat principled C-style solution is to just call posix_memalign / aligned_alloc (C11) to get a block of memory, which occupies entire cache line(s). It makes it impossible for the cache line to be falsely-shared. It's also useful for doing math quickly (SSE, AVX).
I claim we can do this relatively easily by sending these allocations into large_allocate with a special path that uses aligned_alloc (rather than malloc). ~~Current threshold for large alloc is 129, while cache line is 64 on most systems, so we're pretty close to using this infra anyway.~~
See this approach here.
Add padding
A nice workaround is to just add padding. If block spans multiple cache lines and the middle is accessed then alignment doesn't matter. @polytypic implemented it in multicore-magic. A disadvantage here is that we cannot pad area before the atomic (or it wouldn't be a valid Atomic.t), so we can still have accidental false-sharing if some other atomics in the program have not been padded. The positive part is that it's super simple.
If that's the blessed way to go forward, I would propose to add that to Obj or/and create Atomic.make_padded.
Other ideas
@sadiqj mentioned making Atomic.t unboxed to let us pad records like in Java or Golang. It sounds reasonable but I do not have the expertise to create a POC.
cc @kayceesrk
Current threshold for large alloc is 129, while cache line is 64 on most systems, so we're pretty close to using this infra anyway.
This isn't quite right: the threshold for large allocations is 129 words, which is 8256 bytes on 64-bit systems.
In fact, I think the slab allocator should already be cache-aligned for allocations of the right size. Perhaps we should make Atomic.t be a 7-word record, so that with the header it consumes 64 bytes on 64-bit systems?
The multicore allocator tries to allocate objects from domain-local pages. So you'll get false sharing if you allocate three objects sequentially and send the middle one to another domain while using the other two locally, but in general you should not get false sharing between objects allocated on one domain and objects allocated on another.
In other words, to avoid false sharing it is good to take a copy of small objects on the receiving domain, but I don't think you necessarily need to pad.
This isn't quite right: the threshold for large allocations is 129 words, which is 8256 bytes on 64-bit systems.
Rip me. Thanks for pointing out.
In fact, I think the slab allocator should already be cache-aligned for allocations of the right size.
I assume that slabs refers pools in the heap code. As far as I can tell the memory for pools is cache-aligned, but the allocations themselves are not. I tried the following:
value var = caml_alloc_shr(7, 0);
assert((var - 8) % 64 == 32); // succeeds
Could be a red herring, but it coincides with POOL_HEADER_WSIZE = 4.
Perhaps we should make Atomic.t be a 7-word record, so that with the header it consumes 64 bytes on 64-bit systems?
I assume that 8-word pool will only keep 8-word chunks even under sweeping or compaction. With pool-allocs being aligned, this proposition sounds okay. All the assumptions here seem foundational. Ideally, maybe we can wrap it into a separate call? It will make it easier for the users, and easier for the compiler folks to evolve if needed.
However, I'm not sure if making all Atomic.t 7-word is optimal. It's a better world than now, but I would not want to lose current atomic. To add to the list, I can see myself using both (padded and thin atomic) even within the same data structure. For example, a queue with contention in FAD-manipulated indices but also requiring cells to be atomic for correctness.
cc @sadiqj would you have the time to have a look at the issue ? I am temporarily assigning you (with the semantics that assignee people should try to shepherd discussions about issues but not necessarily fix them), but please free to de-assign yourself (or even better to assign someone else).
I’m happy to shoot a PR following @stedolan’s suggestion in a week or two (on holidays now).
On Fri, 24 Mar 2023 at 19:49, Florian Angeletti @.***> wrote:
cc @sadiqj https://github.com/sadiqj would you have the time to have a look at the issue ? I am temporarily assigning you (with the semantics that assignee people should try to shepherd discussions about issues but not necessarily fix them), but please free to de-assign yourself (or even better to assign someone else).
— Reply to this email directly, view it on GitHub https://github.com/ocaml/ocaml/issues/11986#issuecomment-1482746597, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4F7JH73O5PTOXVJSBYXHDW5WJ5PANCNFSM6AAAAAAUN63G2U . You are receiving this because you authored the thread.Message ID: @.***>
I think that this can be closed following #12212, although of course there are other sources of false sharing than atomic values that could degrade program performance.