LocalAI Using "falcon-ggml" / "falcon" backend for Falcon model leads to falcon_model_load: invalid model file(bad magic) error

Error 500

LocalAI version:

commit 8034ed3473fb1c8c6f5e3864933c442b377be52e (HEAD -> master, origin/master, origin/HEAD)
Author: Jesús Espino <[email protected]>
Date:   Sun Oct 15 09:17:41 2023 +0200

Environment, CPU architecture, OS, and Version:

MacOS Ventura 13.5.2 (22G91)
Apple Silicon M2

Describe the bug 500 Error when trying to load model.

11:41AM DBG GRPC(gpt-3.5-turbo-127.0.0.1:51272): stderr falcon_model_load: invalid model file '/Users/andrey/sandbox/local_ai/current/models/gpt-3.5-turbo' (bad magic)
11:41AM DBG GRPC(gpt-3.5-turbo-127.0.0.1:51272): stderr falcon_bootstrap: failed to load model from '/Users/andrey/sandbox/local_ai/current/models/gpt-3.5-turbo'
[127.0.0.1]:51271 500 - POST /v1/chat/completions

To Reproduce

Download model https://huggingface.co/hadongz/falcon-7b-instruct-gguf/blob/main/falcon-7b-instruct-q4_0.gguf
Save it to ./models/gtp-3.5-turbo (Just for example, because I use MacMind client)
Add file ./gpt-3.5-turbo.tmpl with this content:

You are an intelligent chatbot. Help the following question with brilliant answers.
Question: {{.Input}}
Answer:

Add file gpt-3.5-turbo.yaml with this content:

backend: falcon-ggml
context_size: 2000
f16: true
gpu_layers: 1
name: gpt-3.5-turbo
parameters:
  model: gpt-3.5-turbo
  temperature: 0.9
  top_k: 40
  top_p: 0.65

Build using official localai docs for Apple Silicon
Start localAi with this command:

./local-ai --debug

Run request with curl:

(base) andrey@m2 current % curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "What is Abu-Dhabi?"}],
     "temperature": 0.9
   }'
{"created":1697527790,"object":"chat.completion","id":"9587206d-0939-4b40-8f5c-1a0695db9a5c","model":"gpt-3.5-turbo","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":" As an intelligent chatbot, I don't have a physical location, but Abu Dhabi is a city in the United Arab Emirates known for its luxurious lifestyle, beautiful beaches, and modern architecture.\u003c|endoftext|\u003e"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}%

Got errors:

11:41AM DBG GRPC(gpt-3.5-turbo-127.0.0.1:51272): stderr falcon_model_load: invalid model file '/Users/andrey/sandbox/local_ai/current/models/gpt-3.5-turbo' (bad magic)
11:41AM DBG GRPC(gpt-3.5-turbo-127.0.0.1:51272): stderr falcon_bootstrap: failed to load model from '/Users/andrey/sandbox/local_ai/current/models/gpt-3.5-turbo'
[127.0.0.1]:51271 500 - POST /v1/chat/completions

Expected behavior

Some response.

Logs

(base) andrey@m2 current % ./local-ai --debug
11:41AM DBG no galleries to load
11:41AM INF Starting LocalAI using 4 threads, with models path: /Users/andrey/sandbox/local_ai/current/models
11:41AM INF LocalAI version: v1.30.0-28-g8034ed3 (8034ed3473fb1c8c6f5e3864933c442b377be52e)
11:41AM DBG Model: gpt-3.5-turbo (config: {PredictionOptions:{Model:gpt-3.5-turbo Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-3.5-turbo F16:true Threads:0 Debug:false Roles:map[] Embeddings:false Backend:falcon-ggml TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:1 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
11:41AM DBG Extracting backend assets files to /tmp/localai/backend_data

 ┌───────────────────────────────────────────────────┐
 │                   Fiber v2.49.2                   │
 │               http://127.0.0.1:8080               │
 │       (bound on host 0.0.0.0 and port 8080)       │
 │                                                   │
 │ Handlers ............ 71  Processes ........... 1 │
 │ Prefork ....... Disabled  PID .............. 2836 │
 └───────────────────────────────────────────────────┘

11:41AM DBG Request received:
11:41AM DBG Configuration read: &{PredictionOptions:{Model:gpt-3.5-turbo Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-3.5-turbo F16:true Threads:4 Debug:true Roles:map[] Embeddings:false Backend:falcon-ggml TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:1 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
11:41AM DBG Parameters: &{PredictionOptions:{Model:gpt-3.5-turbo Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-3.5-turbo F16:true Threads:4 Debug:true Roles:map[] Embeddings:false Backend:falcon-ggml TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:1 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
11:41AM DBG Prompt (before templating): What is Abu-Dhabi?
11:41AM DBG Template found, input modified to: You are an intelligent chatbot "Esenia". Help the following question with brilliant answers.
Question: What is Abu-Dhabi?
Answer:
11:41AM DBG Prompt (after templating): You are an intelligent chatbot "Esenia". Help the following question with brilliant answers.
Question: What is Abu-Dhabi?
Answer:
11:41AM DBG Loading model falcon-ggml from gpt-3.5-turbo
11:41AM DBG Loading model in memory from file: /Users/andrey/sandbox/local_ai/current/models/gpt-3.5-turbo
11:41AM DBG Loading GRPC Model falcon-ggml: {backendString:falcon-ggml model:gpt-3.5-turbo threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0x140001029c0 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
11:41AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/falcon-ggml
11:41AM DBG GRPC Service for gpt-3.5-turbo will be running at: '127.0.0.1:51272'
11:41AM DBG GRPC Service state dir: /var/folders/f9/1b1jz83s4ysfn9zfncbsb8y40000gn/T/go-processmanager2128385065
11:41AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:51272: connect: connection refused"
11:41AM DBG GRPC(gpt-3.5-turbo-127.0.0.1:51272): stderr 2023/10/17 11:41:38 gRPC Server listening at 127.0.0.1:51272
11:41AM DBG GRPC Service Ready
11:41AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:gpt-3.5-turbo ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:1 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/Users/andrey/sandbox/local_ai/current/models/gpt-3.5-turbo Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
11:41AM DBG GRPC(gpt-3.5-turbo-127.0.0.1:51272): stderr falcon_model_load: invalid model file '/Users/andrey/sandbox/local_ai/current/models/gpt-3.5-turbo' (bad magic)
11:41AM DBG GRPC(gpt-3.5-turbo-127.0.0.1:51272): stderr falcon_bootstrap: failed to load model from '/Users/andrey/sandbox/local_ai/current/models/gpt-3.5-turbo'
[127.0.0.1]:51271 500 - POST /v1/chat/completions

Oct 17 '23 07:10 netandreus

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Nov 13 '25 02:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Nov 20 '25 02:11 github-actions[bot]