marian Very high memory usage of Marian while CPU decoding compared to Amun

It seems that Marian uses considerably more memoy when CPU decoding than Amun. The comparison is done with a RNN model trained with Nematus which has a 515MB npz file While a marian-server with 4 cpu-threads and workspace 256M takes ~4.5GB memory the same model also with 4 cpu threads in Amun only takes ~800MB memory. The memory usage of Marian also seems to almost linearly scale with the number of cpu-threads (I think I've seen 96 threads take ~100GB of memory (?))

Is Marian loading the Model per CPU Thread? Is there a way to get Marians CPU memory consumption to more closely resemble what Amun does? It seems on GPUs the memory behavior is about the same.

Feb 24 '20 11:02 frzme

@frzme Currently each thread gets its own workspace and allocates its own model. There are ways to work around that but it would depend how you use Marian.

For instance, when your have your own server you can use memory-mapping for the models when running on the CPU, and workspace can be handled by an outside server and handed in per query. But that requires coding.

I would need to know more about your specific scenario to give advise.

Apr 20 '20 16:04 emjotde

Currently we are just running the provided marian-server as is and talk to it from a different process. With amun we used a python based server based on the included server python example which required a much lower amount of memory.

Apr 20 '20 16:04 frzme

Right, marian-server is essentially a demo. There is work on an actual server going on, but not sure about the perfomance and resource handling. @ugermann or @kpu how is that going along?

Apr 20 '20 16:04 emjotde

Thank you, that sounds great! I believe that the marian-decoder also has the same memory characteristics however and also requires more ram than amun I can verify tomorrow in case that is useful for you.

Apr 20 '20 17:04 frzme

Oh yes, it's the same code path. In amun it was easy because everything was essentially static and models could just be shared. In Marian this is way more complicated as the inference during translation is just the same as the forward step during training (but with freeing of used nodes). We might think about having a different memory allocator that uses new and does not pre-allocate memory, but that might be sub-optimal in multi-thread as well.

At MS we have our own non-public server where we integrated Marian, but since there is currently no proper server in the repo, the improvements for query-based resource handling are not used anywhere (but they are actually in the code).

Apr 20 '20 18:04 emjotde

@emjotde where are they in the code?

Apr 22 '20 22:04 ugermann

@ugermann This constructor can take a memory mapped model for CPU decoding and reuse across multiple scorer or workers:

https://github.com/marian-nmt/marian/blob/3c7a88f4e974d90a91c11a0ff804f7a494b81937/src/translator/scorers.h#L88

Used here in the MS-internal wrapper: https://github.com/marian-nmt/marian/blob/3c7a88f4e974d90a91c11a0ff804f7a494b81937/src/microsoft/quicksand.cpp#L62

Also here, from pointer and from memory-mapped file: https://github.com/marian-nmt/marian/blob/3c7a88f4e974d90a91c11a0ff804f7a494b81937/src/translator/scorers.h#L157

Then here you can pass a pre-allocated buffer as a workspace: https://github.com/marian-nmt/marian/blob/3c7a88f4e974d90a91c11a0ff804f7a494b81937/src/microsoft/quicksand.cpp#L117 using cpu::WrappedDevice from here: https://github.com/marian-nmt/marian/blob/3c7a88f4e974d90a91c11a0ff804f7a494b81937/src/tensors/device.h#L60

Apr 22 '20 22:04 emjotde

The code inside the src/microsoft is used to wrap Marian into C# code, so it's a good reference. That's essentially a class for a single worker that gets its resources from the outside in form of pre-allocated raw memory or memory-mapped files where possible. I'll be happy to walk you through if needed.

Apr 22 '20 22:04 emjotde

@emjotde: Thanks, that's very useful!

Apr 23 '20 00:04 ugermann

I know! :)

Apr 23 '20 00:04 emjotde

Ah, another thing, to memory-map or consume a model into a memory buffer you need to use the *.bin format, that can be produced with marian-convert. Any of the CPU-side model types will work.

Apr 23 '20 00:04 emjotde

@frmzme: We're building a Marian REST server for a couple of EU projects that we are involved in. The best documentation on that is here: https://github.com/ugermann/marian-docker/. The REST server is currently not in the Marian master branch and will likely be moved to a separate repo, so that we have our own issue tracker and versioning.

Apr 23 '20 00:04 ugermann

marian marian copied to clipboard

Very high memory usage of Marian while CPU decoding compared to Amun

marian
marian copied to clipboard