When will buffer queues be enabled?

Open bodybreaker opened this issue 1 year ago • 1 comments

I noticed that it states that requests can queue when all llama.cpp instances are busy. I was wondering if the queuing is done per llama.cpp server or per slot? I am currently trying to scale up from 1 to multiple llama.cpp servers and the paddler_requests_buffered metric is always 0.

Aug 19 '24 06:08 bodybreaker

@bodybreaker I will check if those metrics work correctly and get back to you.

Aug 20 '24 10:08 mcharytoniuk

@bodybreaker I have just released a new stable version of Paddler (v1.0.0) and changed the CLI framework, overall it underwent a total rewrite.

I think your issue should be solved now, if it still persists feel free to reopen (please check the README though, some flag names have changed).

Nov 20 '24 20:11 mcharytoniuk