ggml Estimate memory requirements for graph

This is sort of in a similar light to #214 , but a bit more general.

It would be useful to be able to estimate the total context memory requirement given some computation graph or a list of tensor descriptions. This would make the implementation of newer models that much easier, since the implementer doesn't need to estimate all the memory usage manually.

For computation graphs, this wouldn't be more overhead as long as the computation graph size was constant between invocations. In that case the context's memory buffer can be re-used (I've successfully done this for GPT2 in https://github.com/smspillaz/ggml-gobject).

I think in order to implement this, you could have a flag on ggml_context such that when new tensors are created in that context, they don't actually allocate any memory for the data (the object overhead can either go into its own memory pool or on to the stack/heap). Writing to the tensors would be a no-op, as well as ggml_graph_compute. Once the computation graph has been created, then the library consumer can query the context's estimated memory usage, which could be done by walking all the objects in the ggml_object list and tallying up their sizes.

I haven't looked very closely at the details - maybe data allocations are needed in order to build the graph somehow which would make this infeasible. But if not, I could try doing this myself and submitting a pull request, if it belongs in the library.

Jun 15 '23 10:06 smspillaz

this is mostly possible if you don't mind reading the implementation of every function to figure out exactly what it does:

https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L942-L958

https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L505-L519

Jun 16 '23 04:06 LoganDark

Yes, this is annoying currently that you have to pre-compute the necessary size. I'm thinking about ways to solve this. The proposed solution is one way to do it. Will try to prioritize this feature soon

Jun 18 '23 10:06 ggerganov

this is mostly possible if you don't mind reading the implementation of every function to figure out exactly what it does:

https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L942-L958

https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L505-L519

the latest version of GGML trashed this so severely (WHY do ggml_views allocate ANOTHER extra tensor now??) that I'm going to have to redo the entire system, so that's fun

Jun 19 '23 20:06 LoganDark

Was just wondering if there was any update on this - I can also start looking into this myself

Aug 03 '23 21:08 smspillaz

There is an implementation in llama.cpp that does this, among other things. It is not entirely automated as you are suggesting here, you have to avoid writing to the tensors while creating a dummy graph for measuring the memory requirements. https://github.com/ggerganov/llama.cpp/pull/2411

Aug 03 '23 21:08 slaren

Was just wondering if there was any update on this - I can also start looking into this myself

well, rwkv.cpp has a new implementation if you're interested that uses "future tensors"—basically predicting the amount of objects and memory that will be used by each tensor operation, and the prediction functions get quite a bit nicer:

https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L481-L612

https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L770-L790

https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L825-L880

https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L958-L1021

https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L1066-L1127C2

https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L1179-L1251

other than that, I have nothing >/

Aug 03 '23 22:08 LoganDark

OK, perhaps I can try to backport ggerganov/llama.cpp#2411 to here?

Aug 04 '23 13:08 smspillaz

Created #433 . I want to try and implement this in ggml-gobject as well, just to test that it works correctly (no reason why it shouldn't, since the allocator parts are relatively standalone)

Aug 04 '23 19:08 smspillaz

@ggerganov occasionally syncs the ggml code in ggml/whisper.cpp/llama.cpp, i suppose you just have to poke him, and he will do it... sometime he has time :)

Aug 05 '23 17:08 Green-Sky