Exl2 models don't seem to be working with exllama2
LocalAI version: v2.12.1
Environment, CPU architecture, OS, and Version: Linux chrispc 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
This is ubuntu 22.04 on WSL2 with Nvidia drivers available in the VM.
Describe the bug Using exllama2 directly just by cloning the repository and installing as per it's github I'm able to use an exl2 model example:
python ./test_inference.py -m ../Mixtral-8x7B-instruct-exl2 -p "In a land far far away ..."
-- Model: ../Mixtral-8x7B-instruct-exl2
-- Options: []
-- Loading model...
-- Loaded model in 32.8406 seconds
-- Loading tokenizer...
-- Warmup...
-- Generating...
In a land far far away ...
A group of explorers, with the help of a few friendly locals, must navigate through a series of increasingly difficult challenges in order to reach their ultimate goal: find the fabled city of gold.
The game is divided into several "scenes" or areas. Each scene contains a set of tasks and puzzles that must be solved in order to move on to the next scene.
Each scene is unique and requires different skills in order to solve the puzzles. Some scenes may require physical strength, others may require agility, and still others may require cunning and intellect.
The game
-- Response generated in 2.50 seconds, 128 tokens, 51.12 tokens/second (includes prompt eval.)
Using the same model with exllama2 through LocalAI I get
curl http://chrispc.zarek.cc:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "Mixtral",
"prompt": "A long time ago in a galaxy far, far away"
}'
{"error":{"code":500,"message":"grpc service not ready","type":""}}
See the logs below in the log section during this time.
To Reproduce
I adjusted the file ./backend/python/exllama2/install.sh to use the master branch of exllama2 just in case.
## A bash script installs the required dependencies of VALL-E-X and prepares the environment
-export SHA=c0ddebaaaf8ffd1b3529c2bb654e650bce2f790f
+#export SHA=c0ddebaaaf8ffd1b3529c2bb654e650bce2f790f
+export SHA=master
I'm building with sudo docker build --build-arg="BUILD_TYPE=cublas" --build-arg="CUDA_MAJOR_VERSION=12" --build-arg="CUDA_MINOR_VERSION=4" -t localai .
Then running this docker-compose:
version: '3.6'
services:
api:
ports:
- 8080:8080
env_file:
- .env
environment:
- MODELS_PATH=/models
- DEBUG=true
- REBUILD=false
volumes:
- ./models:/models:cached
- ./images/:/tmp/generated/images/
- ../Mixtral-8x7B-instruct-exl2/:/Mixtral
- ../Llama2-70B-chat-exl2/:/Llama
- ../Buttercup-4x7B-V2-laser-exl2/:/Buttercup
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
The Mixtral is from: https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2 I've tried 3.5bpw and 3.0bpw this particular run is 3.0 and both work fine when using the built in example from exllama2 and both fail in this same way when using LocalAI.
The file mixtral.yaml in the /models folder is:
name: Mixtral
parameters:
model: /Mixtral
backend: exllama2
Logs
Model name: Buttercup
Model name: Llama
Model name: Mixtral
6:54PM DBG Model: Mixtral (config: {PredictionOptions:{Model:/Mixtral Language: N:0 TopP:0xc0001fc320 TopK:0xc0001fc328 Temperature:0xc0001fc330 Maxtokens:0xc0001fc338 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc360 TypicalP:0xc0001fc358 Seed:0xc0001fc378 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Mixtral F16:0xc0001fc318 Threads:0xc0001fc310 Debug:0xc0001fc370 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc350 MirostatTAU:0xc0001fc348 Mirostat:0xc0001fc340 NGPULayers:0xc0001fc368 MMap:0xc0001fc370 MMlock:0xc0001fc371 LowVRAM:0xc0001fc371 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc308 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}) 6:54PM DBG Model: Buttercup (config: {PredictionOptions:{Model:/Buttercup Language: N:0 TopP:0xc0001fc110 TopK:0xc0001fc118 Temperature:0xc0001fc120 Maxtokens:0xc0001fc128 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc150 TypicalP:0xc0001fc148 Seed:0xc0001fc168 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Buttercup F16:0xc0001fc108 Threads:0xc0001fc100 Debug:0xc0001fc160 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc140 MirostatTAU:0xc0001fc138 Mirostat:0xc0001fc130 NGPULayers:0xc0001fc158 MMap:0xc0001fc160 MMlock:0xc0001fc161 LowVRAM:0xc0001fc161 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc0f8 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}) 6:54PM DBG Model: Llama (config: {PredictionOptions:{Model:/Llama Language: N:0 TopP:0xc0001fc218 TopK:0xc0001fc220 Temperature:0xc0001fc228 Maxtokens:0xc0001fc230 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc258 TypicalP:0xc0001fc250 Seed:0xc0001fc270 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Llama F16:0xc0001fc210 Threads:0xc0001fc208 Debug:0xc0001fc268 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc248 MirostatTAU:0xc0001fc240 Mirostat:0xc0001fc238 NGPULayers:0xc0001fc260 MMap:0xc0001fc268 MMlock:0xc0001fc269 LowVRAM:0xc0001fc269 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc200 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}) 6:54PM DBG Extracting backend assets files to /tmp/localai/backend_data 6:54PM INF core/startup process completed! 6:54PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json 6:54PM DBG No configuration file found at /tmp/localai/config/assistants.json 6:54PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json
┌───────────────────────────────────────────────────┐ │ Fiber v2.52.0 │ │ http://127.0.0.1:8080 │ │ (bound on host 0.0.0.0 and port 8080) │ │ │ │ Handlers ........... 181 Processes ........... 1 │ │ Prefork ....... Disabled PID ................. 1 │ └───────────────────────────────────────────────────┘
6:54PM DBG Request received: {"model":"Mixtral","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":"A long time ago in a galaxy far, far away","instruction":"","input":null,"stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
6:54PM DBG input: &{PredictionOptions:{Model:Mixtral Language: N:0 TopP:
Additional context
It seems that there is an issue with connecting to the gRPC service after loading the Mixtral model. The error message indicates a connection refusal when trying to reach the GRPC service at 127.0.0.1:46605.
To troubleshoot this issue, you can try the following steps:
-
Verify that the backend process is running properly. You can do this by checking the output of the backend command:
- For exllama2:
ps aux | grep exllama2 - For exllama:
ps aux | grep exllama
- For exllama2:
-
Ensure that the firewall is not blocking the GRPC port (
46605in this case). You may need to open the port in the firewall settings or add an exception. -
Check if there are any other instances of the backend process running, as this could cause a conflict. You can do this by checking the process list using the
pscommand and looking for any duplicate processes. -
Make sure that there is no network issue preventing the connection to the GRPC service. Check the network connectivity between the host and
127.0.0.1:46605. -
Try restarting the backend process and see if the issue persists. You can do this by stopping the current process and starting a new one, for example:
- For exllama2:
kill -9 <process_id> ; exllama2 - For exllama:
kill -9 <process_id> ; exllama
- For exllama2:
If the issue still persists after trying these steps, you may need to look into specific configuration settings or seek further assistance from the support channels for the Mixtral model or the backend you are using (exllama2, exllama, etc.).
I also wanted to give exl2 a shot, the model is loading & grpc server seems fine, but i get this error on inference:
Error rpc error: code = Unknown desc = Exception iterating responses: 'Result' object is not an iterator