serge Option for pre-loading specific models into memory

Not sure if this feature is possible, but I'd like the ability to specify (preferably in my .env file) models to leave pre-loaded in memory. It shouldn't be the default choice, but it would allow bandwidth-constrained servers to run faster, as well as reducing overall latency when running as an API.

Thanks for making this, and I look forward to seeing your plans for the API refactor! :smiley:

Mar 22 '23 23:03 toasterrepairman

Here are some thoughts :

But maybe a simpler way would be to declare a path as a tmpfs in the docker-compose file, and have the api code copy the files into that tmpfs location at startup.

Another thought : using https://github.com/hyperonym/basaran

I'm curious what other ideas people will come with on that matter

Mar 23 '23 07:03 thomasleveil

Hi, on my side I already tested by mounting in tmpfs the /var/lib/docker directory and the repository directory (manually via the unix system), and it's hardly faster if you already have an nvme, in any case I didn't notice much difference.

Mar 23 '23 11:03 maxime-dlabai

On the other hand I wonder if increasing the priority of the processes at startup when it generates a discussion would be interesting ex: "chrt -f 90 llama". And above all, wouldn't it be more efficient to run the process directly outside of docker.

Mar 23 '23 11:03 maxime-dlabai

Just clone and build branch with mmap allocation, and yes is faster than main branch, is like instant when process is running. https://github.com/ggerganov/llama.cpp/tree/mmap I try some modification for allocation on memory in main.cpp.

Mar 23 '23 12:03 maxime-dlabai

Here is main version :

Mar 23 '23 12:03 maxime-dlabai

And version custom with mmap allocation (i added some code to enable back avx512 on this version).

Mar 23 '23 12:03 maxime-dlabai

Sorry here is last main version previous screenshot is an older main version than 2 days before i think.

Mar 23 '23 12:03 maxime-dlabai

Closed via #129

Apr 05 '23 03:04 gaby