LocalAI Exl2 models don't seem to be working with exllama2

LocalAI version: v2.12.1

Environment, CPU architecture, OS, and Version: Linux chrispc 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

This is ubuntu 22.04 on WSL2 with Nvidia drivers available in the VM.

Describe the bug Using exllama2 directly just by cloning the repository and installing as per it's github I'm able to use an exl2 model example:

python ./test_inference.py -m ../Mixtral-8x7B-instruct-exl2 -p "In a land far far away ..."
 -- Model: ../Mixtral-8x7B-instruct-exl2
 -- Options: []
 -- Loading model...
 -- Loaded model in 32.8406 seconds
 -- Loading tokenizer...
 -- Warmup...
 -- Generating...

In a land far far away ...

A group of explorers, with the help of a few friendly locals, must navigate through a series of increasingly difficult challenges in order to reach their ultimate goal: find the fabled city of gold.

The game is divided into several "scenes" or areas. Each scene contains a set of tasks and puzzles that must be solved in order to move on to the next scene.

Each scene is unique and requires different skills in order to solve the puzzles. Some scenes may require physical strength, others may require agility, and still others may require cunning and intellect.

The game

 -- Response generated in 2.50 seconds, 128 tokens, 51.12 tokens/second (includes prompt eval.)

Using the same model with exllama2 through LocalAI I get

curl http://chrispc.zarek.cc:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "Mixtral",
     "prompt": "A long time ago in a galaxy far, far away"
   }'
{"error":{"code":500,"message":"grpc service not ready","type":""}}

See the logs below in the log section during this time.

To Reproduce

I adjusted the file ./backend/python/exllama2/install.sh to use the master branch of exllama2 just in case.

 ## A bash script installs the required dependencies of VALL-E-X and prepares the environment
-export SHA=c0ddebaaaf8ffd1b3529c2bb654e650bce2f790f
+#export SHA=c0ddebaaaf8ffd1b3529c2bb654e650bce2f790f
+export SHA=master

I'm building with sudo docker build --build-arg="BUILD_TYPE=cublas" --build-arg="CUDA_MAJOR_VERSION=12" --build-arg="CUDA_MINOR_VERSION=4" -t localai .

Then running this docker-compose:

version: '3.6'

services:
  api:
    ports:
      - 8080:8080
    env_file:
      - .env
    environment:
      - MODELS_PATH=/models
      - DEBUG=true
      - REBUILD=false
    volumes:
      - ./models:/models:cached
      - ./images/:/tmp/generated/images/
      - ../Mixtral-8x7B-instruct-exl2/:/Mixtral
      - ../Llama2-70B-chat-exl2/:/Llama
      - ../Buttercup-4x7B-V2-laser-exl2/:/Buttercup
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

The Mixtral is from: https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2 I've tried 3.5bpw and 3.0bpw this particular run is 3.0 and both work fine when using the built in example from exllama2 and both fail in this same way when using LocalAI.

The file mixtral.yaml in the /models folder is:

name: Mixtral
parameters:
  model: /Mixtral
backend: exllama2

Logs

@@@@@ Skipping rebuild @@@@@ If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed: CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" see the documentation at: https://localai.io/basics/build/index.html Note: See also https://github.com/go-skynet/LocalAI/issues/288 @@@@@ CPU info: model name : AMD Ryzen 5 5600X 6-Core Processor flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm CPU: AVX found OK CPU: AVX2 found OK CPU: no AVX512 found @@@@@ 6:54PM INF Starting LocalAI using 4 threads, with models path: /models 6:54PM INF LocalAI version: v2.12.1 (cc3d601836891fc4694745929f90204c684b4152) 6:54PM INF Preloading models from /models

Model name: Buttercup

Model name: Llama

Model name: Mixtral

6:54PM DBG Model: Mixtral (config: {PredictionOptions:{Model:/Mixtral Language: N:0 TopP:0xc0001fc320 TopK:0xc0001fc328 Temperature:0xc0001fc330 Maxtokens:0xc0001fc338 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc360 TypicalP:0xc0001fc358 Seed:0xc0001fc378 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Mixtral F16:0xc0001fc318 Threads:0xc0001fc310 Debug:0xc0001fc370 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc350 MirostatTAU:0xc0001fc348 Mirostat:0xc0001fc340 NGPULayers:0xc0001fc368 MMap:0xc0001fc370 MMlock:0xc0001fc371 LowVRAM:0xc0001fc371 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc308 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}) 6:54PM DBG Model: Buttercup (config: {PredictionOptions:{Model:/Buttercup Language: N:0 TopP:0xc0001fc110 TopK:0xc0001fc118 Temperature:0xc0001fc120 Maxtokens:0xc0001fc128 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc150 TypicalP:0xc0001fc148 Seed:0xc0001fc168 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Buttercup F16:0xc0001fc108 Threads:0xc0001fc100 Debug:0xc0001fc160 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc140 MirostatTAU:0xc0001fc138 Mirostat:0xc0001fc130 NGPULayers:0xc0001fc158 MMap:0xc0001fc160 MMlock:0xc0001fc161 LowVRAM:0xc0001fc161 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc0f8 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}) 6:54PM DBG Model: Llama (config: {PredictionOptions:{Model:/Llama Language: N:0 TopP:0xc0001fc218 TopK:0xc0001fc220 Temperature:0xc0001fc228 Maxtokens:0xc0001fc230 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc258 TypicalP:0xc0001fc250 Seed:0xc0001fc270 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Llama F16:0xc0001fc210 Threads:0xc0001fc208 Debug:0xc0001fc268 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc248 MirostatTAU:0xc0001fc240 Mirostat:0xc0001fc238 NGPULayers:0xc0001fc260 MMap:0xc0001fc268 MMlock:0xc0001fc269 LowVRAM:0xc0001fc269 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc200 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}) 6:54PM DBG Extracting backend assets files to /tmp/localai/backend_data 6:54PM INF core/startup process completed! 6:54PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json 6:54PM DBG No configuration file found at /tmp/localai/config/assistants.json 6:54PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json

┌───────────────────────────────────────────────────┐ │ Fiber v2.52.0 │ │ http://127.0.0.1:8080 │ │ (bound on host 0.0.0.0 and port 8080) │ │ │ │ Handlers ........... 181 Processes ........... 1 │ │ Prefork ....... Disabled PID ................. 1 │ └───────────────────────────────────────────────────┘

6:54PM DBG Request received: {"model":"Mixtral","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":"A long time ago in a galaxy far, far away","instruction":"","input":null,"stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""} 6:54PM DBG input: &{PredictionOptions:{Model:Mixtral Language: N:0 TopP: TopK: Temperature: Maxtokens: Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ: TypicalP: Seed: NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Context:context.Background.WithCancel Cancel:0x4ab9a0 File: ResponseFormat:{Type:} Size: Prompt:A long time ago in a galaxy far, far away Instruction: Input: Stop: Messages:[] Functions:[] FunctionCall: Tools:[] ToolsChoice: Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject: Backend: ModelBaseName:} 6:54PM DBG Parameter Config: &{PredictionOptions:{Model:/Mixtral Language: N:0 TopP:0xc0001fc320 TopK:0xc0001fc328 Temperature:0xc0001fc330 Maxtokens:0xc0001fc338 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc360 TypicalP:0xc0001fc358 Seed:0xc0001fc378 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Mixtral F16:0xc0001fc318 Threads:0xc0001fc310 Debug:0xc000398bc8 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[A long time ago in a galaxy far, far away] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc350 MirostatTAU:0xc0001fc348 Mirostat:0xc0001fc340 NGPULayers:0xc0001fc368 MMap:0xc0001fc370 MMlock:0xc0001fc371 LowVRAM:0xc0001fc371 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc308 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:} 6:54PM INF Loading model '/Mixtral' with backend exllama2 6:54PM DBG Loading model in memory from file: /models/Mixtral 6:54PM DBG Loading Model /Mixtral with gRPC (file: /models/Mixtral) (backend: exllama2): {backendString:exllama2 model:/Mixtral threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0004d6000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false} 6:54PM DBG Loading external backend: /build/backend/python/exllama2/run.sh 6:54PM DBG Loading GRPC Process: /build/backend/python/exllama2/run.sh 6:54PM DBG GRPC Service for /Mixtral will be running at: '127.0.0.1:46605' 6:54PM DBG GRPC Service state dir: /tmp/go-processmanager4021755915 6:54PM DBG GRPC Service Started 6:55PM ERR failed starting/connecting to the gRPC service error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:46605: connect: connection refused"" 6:55PM DBG GRPC Service NOT ready [192.168.1.202]:50909 500 - POST /v1/completions

Additional context

Apr 10 '24 19:04 chris-sanders

It seems that there is an issue with connecting to the gRPC service after loading the Mixtral model. The error message indicates a connection refusal when trying to reach the GRPC service at 127.0.0.1:46605.

To troubleshoot this issue, you can try the following steps:

Verify that the backend process is running properly. You can do this by checking the output of the backend command:
- For exllama2: ps aux | grep exllama2
- For exllama: ps aux | grep exllama
Ensure that the firewall is not blocking the GRPC port (46605 in this case). You may need to open the port in the firewall settings or add an exception.
Check if there are any other instances of the backend process running, as this could cause a conflict. You can do this by checking the process list using the ps command and looking for any duplicate processes.
Make sure that there is no network issue preventing the connection to the GRPC service. Check the network connectivity between the host and 127.0.0.1:46605.
Try restarting the backend process and see if the issue persists. You can do this by stopping the current process and starting a new one, for example:
- For exllama2: kill -9 <process_id> ; exllama2
- For exllama: kill -9 <process_id> ; exllama

If the issue still persists after trying these steps, you may need to look into specific configuration settings or seek further assistance from the support channels for the Mixtral model or the backend you are using (exllama2, exllama, etc.).

Apr 15 '24 17:04 localai-bot

I also wanted to give exl2 a shot, the model is loading & grpc server seems fine, but i get this error on inference: Error rpc error: code = Unknown desc = Exception iterating responses: 'Result' object is not an iterator

Jun 08 '24 07:06 Nold360