LocalAI LLaMA-7B-q4 inference only 4 threads are used

LLaMA-7B-q4 inference only 4 threads are used

Open imajiayu opened this issue 1 year ago • 1 comments

LocalAI version:

Jun 8, 2023 Environment, CPU architecture, OS, and Version:

Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Describe the bug

ggml-model-q4_0.bin

docker-compose up -d --pull always
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.bin",            
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

only 4 cpus are used, while I have 40 on a single socket.

Jun 09 '23 02:06 imajiayu

@imajiayu did you set the threads here? https://github.com/go-skynet/LocalAI/blob/6bb562272dada1da893f8fb1bfc768b6d819d2de/.env#L3

Jun 09 '23 06:06 mudler

Thanks a lot.

Jun 10 '23 14:06 imajiayu

LocalAI LocalAI copied to clipboard

LLaMA-7B-q4 inference only 4 threads are used

LocalAI
LocalAI copied to clipboard