Option for pre-loading specific models into memory
Not sure if this feature is possible, but I'd like the ability to specify (preferably in my .env file) models to leave pre-loaded in memory. It shouldn't be the default choice, but it would allow bandwidth-constrained servers to run faster, as well as reducing overall latency when running as an API.
Thanks for making this, and I look forward to seeing your plans for the API refactor! :smiley:
Here are some thoughts :
But maybe a simpler way would be to declare a path as a tmpfs in the docker-compose file, and have the api code copy the files into that tmpfs location at startup.
Another thought : using https://github.com/hyperonym/basaran
I'm curious what other ideas people will come with on that matter
Hi, on my side I already tested by mounting in tmpfs the /var/lib/docker directory and the repository directory (manually via the unix system), and it's hardly faster if you already have an nvme, in any case I didn't notice much difference.
On the other hand I wonder if increasing the priority of the processes at startup when it generates a discussion would be interesting ex: "chrt -f 90 llama". And above all, wouldn't it be more efficient to run the process directly outside of docker.
Just clone and build branch with mmap allocation, and yes is faster than main branch, is like instant when process is running. https://github.com/ggerganov/llama.cpp/tree/mmap I try some modification for allocation on memory in main.cpp.
Here is main version :

And version custom with mmap allocation (i added some code to enable back avx512 on this version).

Sorry here is last main version previous screenshot is an older main version than 2 days before i think.

Closed via #129