snabb
snabb copied to clipboard
Elaborate ctable inter-process memory sharing story
When the ctable was initially written in the form of the "PodHashMap", its load and save routines were careful to allow the ctable to be accessed directly from disk via mmap. It was a private mapping, but if multiple processes loaded a compiled ctable, probably they would share those pages between the processes -- at least until one of them needed to update the table, in which case the locally modified pages would be copied fresh.
However for a long time, it hasn't worked like this, because of hugepages. If the table is large enough, we want it in hugepages. However we can't map memory from a file directly into hugepages. MAP_HUGETLB only works when mapping anonymous (not-file-backed) memory.
What we currently do is allocate hugetlb memory, then copy the ctable data into that memory. That adds to the per-process private dirty memory set, preventing sharing. On the other hand, this does more or less guarantee that we won't cause disk paging, and the amount of memory we're talking about it's doable. By way of example, for 32 bytes per entry -- entry size depends on key and value types -- with 40% occupancy, you get 12.5M entries per gigabyte.
If it's important, we could try to do something with hugetlbfs or something. But I guess it's simplest to leave it like it is. For the single-process case it doesn't matter at all. For multiple processes, presumably different processes would access different entries, which would pull in different cache lines to LLC, so we don't get the benefit of cache sharing. Separate copies do put pressure on the TLB however, which we wouldn't have if the pages were shared.
Another downside of sharing ctable entries would be that sharing would degrade over time, as entries are added or removed from the table. That could be mitigated with MAP_SHARED, but that would introduce concurrent reads and writes, which we avoid by having separate copies. In contrast, copying initially means that future changes to the tables don't qualitatively affect the program's TLB/cache footprint.
I'm opening this issue as a design question. I think I'm resigned to what we have now, but other thoughts are welcome.
If you do decide you want to share huge pages between processes then check out the trick in memory.lua that does this for DMA memory. The basic technique is to use ljsyscall to automatically mount /var/run/snabb/hugetlbfs where we can allocate file-backed huge pages and map them from multiple processes.
Good idea. I do hesitate a bit -- without a concurrency-safe hash table (something we don't want i think), we'd need to MAP_PRIVATE, but in that case a write might cause a new hugepage to be created, which might fail and fault!