Model loading more than 60x solwer compared to Serge (llama backend)
Hello,
I was testing some models using Serge. I noticed that the same models load way slower using the LocalAI backend. For starters, I could not even load the mixtral-8x7b models without getting a timeout error (after more than 30 minutes). While on Serge the same model loads and even starts streaming results after less than 30 seconds.
I also tested the mistral model available on Serge with the LocalAI framework. 3 seconds loading time for Serge, 3-6 minutes for LocalAI (also independent of CPU or GPU version).
I'm running
LocalAI version:
docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -v $PWD/models:/models --name local-ai localai/localai:latest-aio-gpu-nvidia-cuda-12 --models-path /models --context-size 1000 --threads 14
LocalAI version: v2.15.0 (f69de3be0d274a676f1d1cd302dc4699f1b5aaf0)
Environment, CPU architecture, OS, and Version:
with 13th Gen Intel(R) Core(TM) i9-13900H 2.60 GHz on Windows 11 in Docker for Windows.
Describe the bug
Extremely long loading times of LocalAI for llama-cpp backend compared to Serge framework while using the same models and hardware. Loading times are independent of the LocalAI CPU or GPU version.
| LocalAI Mistral model | Serge Mistral model | LocalAI Mixtral model | Serge Mixtral model |
|---|---|---|---|
| 5 minutes on average | 3 seconds | times out | 30 seconds |
To Reproduce
Expected behavior
Similar loading speeds, as provided by Serge, should be possible for LocalAI.
Logs
For comparison, I also added the Serge log.
Additional context