Status of MMAP blocks in src/translator/translator.h
What's the status of the #if MMAP blocks in src/translator/translator.h? Is that code working, and why are they disabled?
Hi, As the comments say, that was added for diagnostic reasons to check if the mmapping mechanisms work. We use the pointer-based model loading from a buffer in our MS-internal server but have not exposed this. The mio code was meant to quickly simulate and test that. There is no good reason to not make that available apart from the fact that this needs to be coded cleanly. The current thing is only a hack and not ready for public consumption IMO.
Thanks. I'll look into this a bit more, then.
Especially if you want to use that in your server, you should totally go for it. Let me know if you need help.
You may also want to look at https://github.com/marian-nmt/marian-dev/blob/c944633dd257a0f8cde04e4a8ae1f8f5334f0f22/src/microsoft/quicksand.cpp#L62 which is exactly how we use the loading from buffers and at https://github.com/marian-nmt/marian-dev/blob/c944633dd257a0f8cde04e4a8ae1f8f5334f0f22/src/microsoft/quicksand.cpp#L117 where we can use an existing buffer as working memory.
So you can use many workers (some or most of them dormant) and say 8 working memory buffers that get used whenever a worker becomes actually active.
That currently only works on the CPU, but a similar mechanism could be implemented on GPU too.
If you're thinking about a Bergamot context, keep in mind that we need to run on different SIMD widths. And the weight storage format depends on SIMD width.
If you need to make changes to the *.bin format let us know as soon as possible so we can figure these out together. We currently depend quite heavily on that format.
Just to reiterate what I've said elsewhere: Ideally, binary models should be self-explanatory to Marian so that it knows what settings to run with. A binary model should self-advertise its type (float32, intgemm16, intgemm8, intgemm8shifted, intgemm8shiftedAlpha, ingemm8shiftedAlfaAll, ingemm8shiftedAlphaAllAndTheKitchenSink, etc.) via a magic number or a magic line.
On the level of individual weight matrices this is exactly the case, no?
@ugermann To expand upon "the weight storage format depends on SIMD width" that means the representation in RAM of the weights depends on whether the CPU supports SSSE3, AVX2, or AVX512. Currently this is handled by storing in a canonical format on disk then reordering to a CPU-dependent format at load time. Microsoft handles it in mmap by forcing everything to be AVX2.
There are cases where users will change CPU model on a regular basis, for example with home directories on shared filesystems.
In theory one could have separate mmap files for each format lazily created on disk. This would require either doubling storage in the normal case (storing the baseline and the CPU-specific format) or writing code to transform from one format to another on the fly, which is doable but annoying.
We could also load with mmap MAP_PRIVATE and then reorder but then basically nothing will be shared.
Thanks for the clarification. My primary goal is to avoid having n copies of the read-only parameters in memory if I have n CPU workers during inference. They should share that memory. This is more of an ELG thing than Bergamot, where we'll have only one or two workers anyway. Faster loading is nice but not quite as crucial, so I'm fine with conversion during loading.