marian-dev icon indicating copy to clipboard operation
marian-dev copied to clipboard

Status of MMAP blocks in src/translator/translator.h

Open ugermann opened this issue 5 years ago • 10 comments

What's the status of the #if MMAP blocks in src/translator/translator.h? Is that code working, and why are they disabled?

ugermann avatar Aug 02 '20 20:08 ugermann

Hi, As the comments say, that was added for diagnostic reasons to check if the mmapping mechanisms work. We use the pointer-based model loading from a buffer in our MS-internal server but have not exposed this. The mio code was meant to quickly simulate and test that. There is no good reason to not make that available apart from the fact that this needs to be coded cleanly. The current thing is only a hack and not ready for public consumption IMO.

emjotde avatar Aug 03 '20 14:08 emjotde

Thanks. I'll look into this a bit more, then.

ugermann avatar Aug 03 '20 15:08 ugermann

Especially if you want to use that in your server, you should totally go for it. Let me know if you need help.

emjotde avatar Aug 03 '20 15:08 emjotde

You may also want to look at https://github.com/marian-nmt/marian-dev/blob/c944633dd257a0f8cde04e4a8ae1f8f5334f0f22/src/microsoft/quicksand.cpp#L62 which is exactly how we use the loading from buffers and at https://github.com/marian-nmt/marian-dev/blob/c944633dd257a0f8cde04e4a8ae1f8f5334f0f22/src/microsoft/quicksand.cpp#L117 where we can use an existing buffer as working memory.

So you can use many workers (some or most of them dormant) and say 8 working memory buffers that get used whenever a worker becomes actually active.

That currently only works on the CPU, but a similar mechanism could be implemented on GPU too.

emjotde avatar Aug 03 '20 15:08 emjotde

If you're thinking about a Bergamot context, keep in mind that we need to run on different SIMD widths. And the weight storage format depends on SIMD width.

kpu avatar Aug 03 '20 22:08 kpu

If you need to make changes to the *.bin format let us know as soon as possible so we can figure these out together. We currently depend quite heavily on that format.

emjotde avatar Aug 04 '20 01:08 emjotde

Just to reiterate what I've said elsewhere: Ideally, binary models should be self-explanatory to Marian so that it knows what settings to run with. A binary model should self-advertise its type (float32, intgemm16, intgemm8, intgemm8shifted, intgemm8shiftedAlpha, ingemm8shiftedAlfaAll, ingemm8shiftedAlphaAllAndTheKitchenSink, etc.) via a magic number or a magic line.

ugermann avatar Aug 04 '20 01:08 ugermann

On the level of individual weight matrices this is exactly the case, no?

emjotde avatar Aug 04 '20 01:08 emjotde

@ugermann To expand upon "the weight storage format depends on SIMD width" that means the representation in RAM of the weights depends on whether the CPU supports SSSE3, AVX2, or AVX512. Currently this is handled by storing in a canonical format on disk then reordering to a CPU-dependent format at load time. Microsoft handles it in mmap by forcing everything to be AVX2.

There are cases where users will change CPU model on a regular basis, for example with home directories on shared filesystems.

In theory one could have separate mmap files for each format lazily created on disk. This would require either doubling storage in the normal case (storing the baseline and the CPU-specific format) or writing code to transform from one format to another on the fly, which is doable but annoying.

We could also load with mmap MAP_PRIVATE and then reorder but then basically nothing will be shared.

kpu avatar Aug 04 '20 08:08 kpu

Thanks for the clarification. My primary goal is to avoid having n copies of the read-only parameters in memory if I have n CPU workers during inference. They should share that memory. This is more of an ELG thing than Bergamot, where we'll have only one or two workers anyway. Faster loading is nice but not quite as crucial, so I'm fine with conversion during loading.

ugermann avatar Aug 04 '20 11:08 ugermann