Dynamic batch size estimation
Specific Demand
When using batches, it can be difficult to find the right batch size especially with a wide variety of hardware with end user applications.
Implement Suggestion
Instead of accepting a fixed batch size, we could try to estimate the max batch size the client can handle:
- We need to get the memory usage of the device with something like this
- Run with a batch size of one and figure out about how much memory was allocated
- Find the max batch size the client can handle
Is this equivalent to llama.cpp batch size setting? If so I recall seeing various llama-bench results where the benefits of larger batching waned after 512 size. So how you assess "max" size that a client can handle might want to take that into account too?
Is this equivalent to
llama.cppbatch size setting? If so I recall seeing variousllama-benchresults where the benefits of larger batching waned after512size. So how you assess "max" size that a client can handle might want to take that into account too?
Yes it is similar to batches in llama.cpp, in kalosm the batches could also be for embedding models like bert. I have noticed slower performance when large batch sizes are used to load prompts which is definitely something to look into more when implementing batch size estimation