Model loading more than 60x solwer compared to Serge (llama backend)

Open CodeMazeSolver opened this issue 1 year ago • 0 comments

Hello, I was testing some models using Serge. I noticed that the same models load way slower using the LocalAI backend. For starters, I could not even load the mixtral-8x7b models without getting a timeout error (after more than 30 minutes). While on Serge the same model loads and even starts streaming results after less than 30 seconds. I also tested the mistral model available on Serge with the LocalAI framework. 3 seconds loading time for Serge, 3-6 minutes for LocalAI (also independent of CPU or GPU version).

I'm running

LocalAI version:

docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -v $PWD/models:/models --name local-ai localai/localai:latest-aio-gpu-nvidia-cuda-12 --models-path /models --context-size 1000 --threads 14

LocalAI version: v2.15.0 (f69de3be0d274a676f1d1cd302dc4699f1b5aaf0)

Environment, CPU architecture, OS, and Version:

with 13th Gen Intel(R) Core(TM) i9-13900H 2.60 GHz on Windows 11 in Docker for Windows.

Describe the bug

Extremely long loading times of LocalAI for llama-cpp backend compared to Serge framework while using the same models and hardware. Loading times are independent of the LocalAI CPU or GPU version.

LocalAI Mistral model	Serge Mistral model	LocalAI Mixtral model	Serge Mixtral model
5 minutes on average	3 seconds	times out	30 seconds

To Reproduce

Expected behavior

Similar loading speeds, as provided by Serge, should be possible for LocalAI.

Logs

localai.log

For comparison, I also added the Serge log.

serge.log

Additional context

May 24 '24 08:05 CodeMazeSolver