floneum Dynamic batch size estimation

Specific Demand

When using batches, it can be difficult to find the right batch size especially with a wide variety of hardware with end user applications.

Implement Suggestion

Instead of accepting a fixed batch size, we could try to estimate the max batch size the client can handle:

We need to get the memory usage of the device with something like this
Run with a batch size of one and figure out about how much memory was allocated
Find the max batch size the client can handle

May 27 '24 02:05 ealmloff

Is this equivalent to llama.cpp batch size setting? If so I recall seeing various llama-bench results where the benefits of larger batching waned after 512 size. So how you assess "max" size that a client can handle might want to take that into account too?

Jun 08 '24 07:06 polarathene

Is this equivalent to llama.cpp batch size setting? If so I recall seeing various llama-bench results where the benefits of larger batching waned after 512 size. So how you assess "max" size that a client can handle might want to take that into account too?

Yes it is similar to batches in llama.cpp, in kalosm the batches could also be for embedding models like bert. I have noticed slower performance when large batch sizes are used to load prompts which is definitely something to look into more when implementing batch size estimation

Jun 09 '24 00:06 ealmloff