Estimate memory requirements for graph
This is sort of in a similar light to #214 , but a bit more general.
It would be useful to be able to estimate the total context memory requirement given some computation graph or a list of tensor descriptions. This would make the implementation of newer models that much easier, since the implementer doesn't need to estimate all the memory usage manually.
For computation graphs, this wouldn't be more overhead as long as the computation graph size was constant between invocations. In that case the context's memory buffer can be re-used (I've successfully done this for GPT2 in https://github.com/smspillaz/ggml-gobject).
I think in order to implement this, you could have a flag on ggml_context such that when new tensors are created in that context, they don't actually allocate any memory for the data (the object overhead can either go into its own memory pool or on to the stack/heap). Writing to the tensors would be a no-op, as well as ggml_graph_compute. Once the computation graph has been created, then the library consumer can query the context's estimated memory usage, which could be done by walking all the objects in the ggml_object list and tallying up their sizes.
I haven't looked very closely at the details - maybe data allocations are needed in order to build the graph somehow which would make this infeasible. But if not, I could try doing this myself and submitting a pull request, if it belongs in the library.
this is mostly possible if you don't mind reading the implementation of every function to figure out exactly what it does:
https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L942-L958
https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L505-L519
Yes, this is annoying currently that you have to pre-compute the necessary size. I'm thinking about ways to solve this. The proposed solution is one way to do it. Will try to prioritize this feature soon
this is mostly possible if you don't mind reading the implementation of every function to figure out exactly what it does:
https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L942-L958
https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L505-L519
the latest version of GGML trashed this so severely (WHY do ggml_views allocate ANOTHER extra tensor now??) that I'm going to have to redo the entire system, so that's fun
Was just wondering if there was any update on this - I can also start looking into this myself
There is an implementation in llama.cpp that does this, among other things. It is not entirely automated as you are suggesting here, you have to avoid writing to the tensors while creating a dummy graph for measuring the memory requirements. https://github.com/ggerganov/llama.cpp/pull/2411
Was just wondering if there was any update on this - I can also start looking into this myself
well, rwkv.cpp has a new implementation if you're interested that uses "future tensors"—basically predicting the amount of objects and memory that will be used by each tensor operation, and the prediction functions get quite a bit nicer:
https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L481-L612
https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L770-L790
https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L825-L880
https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L958-L1021
https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L1066-L1127C2
https://github.com/saharNooby/rwkv.cpp/blob/84f34c548b4d24981a0a6f2ee5c4030686f26ced/rwkv.cpp#L1179-L1251
other than that, I have nothing >/
OK, perhaps I can try to backport ggerganov/llama.cpp#2411 to here?
Created #433 . I want to try and implement this in ggml-gobject as well, just to test that it works correctly (no reason why it shouldn't, since the allocator parts are relatively standalone)
@ggerganov occasionally syncs the ggml code in ggml/whisper.cpp/llama.cpp, i suppose you just have to poke him, and he will do it... sometime he has time :)