floneum icon indicating copy to clipboard operation
floneum copied to clipboard

Dynamic batch size estimation

Open ealmloff opened this issue 1 year ago • 2 comments

Specific Demand

When using batches, it can be difficult to find the right batch size especially with a wide variety of hardware with end user applications.

Implement Suggestion

Instead of accepting a fixed batch size, we could try to estimate the max batch size the client can handle:

  1. We need to get the memory usage of the device with something like this
  2. Run with a batch size of one and figure out about how much memory was allocated
  3. Find the max batch size the client can handle

ealmloff avatar May 27 '24 02:05 ealmloff

Is this equivalent to llama.cpp batch size setting? If so I recall seeing various llama-bench results where the benefits of larger batching waned after 512 size. So how you assess "max" size that a client can handle might want to take that into account too?

polarathene avatar Jun 08 '24 07:06 polarathene

Is this equivalent to llama.cpp batch size setting? If so I recall seeing various llama-bench results where the benefits of larger batching waned after 512 size. So how you assess "max" size that a client can handle might want to take that into account too?

Yes it is similar to batches in llama.cpp, in kalosm the batches could also be for embedding models like bert. I have noticed slower performance when large batch sizes are used to load prompts which is definitely something to look into more when implementing batch size estimation

ealmloff avatar Jun 09 '24 00:06 ealmloff