LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

MLX Backend raises 401 client error when loading models

Open johndev168 opened this issue 4 months ago • 1 comments

LocalAI version: 3.5.0

Environment, CPU architecture, OS, and Version: Mac T8132 MacOS 15.5 24F74 no VM

Describe the bug Loading a model with MLX backend does not work. It refuses to load the model and instead throws a 401 client error:

Internal error: failed to load model with internal loader: could not load model (no success): Error loading MLX model: 401 Client Error. (Request ID: Root=***redacted***)

Repository Not Found for url: https://huggingface.co/api/models/gpt-oss-20b-mxfp4.gguf/revision/main.

Please make sure you specified the correct repo_id and repo_type.

If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication

Invalid username or password.

This error occurs no matter which model I want to load. If I change the backend to llama-cpp or even the metal llama cpp version it works just fine.

To Reproduce Install MLX backend and set the backend from any available model to MLX. Then go to chat tab and ask something. The model will not load and instead throw the error.

Expected behavior The model should be loaded and ready to be used.

Logs

9:02AM DBG context local model name not found, setting to the first model first model name=smoothie-qwen3-8b
9:02AM DBG Chat endpoint configuration read: &{PredictionOptions:{BasicModelRequest:{Model:gpt-oss-20b-mxfp4.gguf} Language: Translate:false N:0 TopP:0x140005e3280 TopK:0x140005e32a0 Temperature:0x140005e32a8 Maxtokens:0x140005e32c8 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0x140005e32d0 TypicalP:0x140005e32d8 Seed:0x140005e32e0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:gpt-oss-20b F16:0x140005e32f0 Threads:0x140005e32f8 Debug:0x140004166d8 Roles:map[] Embeddings:0x140005e3301 Backend:mlx TemplateConfig:{Chat:<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: {{ now | date "Mon Jan 2 15:04:05 MST 2006" }}

Reasoning: {{if eq .ReasoningEffort ""}}medium{{else}}{{.ReasoningEffort}}{{end}}

# {{with .Metadata}}{{ if ne .system_prompt "" }}{{ .system_prompt }}{{ end }}{{else}}You are a friendly and helpful assistant.{{ end }}<|end|>{{- .Input -}}<|start|>assistant ChatMessage:<|start|>{{ if .FunctionCall -}}functions.{{ .FunctionCall.Name }} to=assistant{{ else if eq .RoleName "assistant"}}assistant<|channel|>final<|message|>{{else}}{{ .RoleName }}{{end}}<|message|>
{{- if .Content -}}
{{- .Content -}}
{{- end -}}
{{- if .FunctionCall -}}
{{- toJson .FunctionCall -}}
{{- end -}}<|end|> Completion:{{.Input}}
 Edit: Functions:<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: {{ now | date "Mon Jan 2 15:04:05 MST 2006" }}

Reasoning: {{if eq .ReasoningEffort ""}}medium{{else}}{{.ReasoningEffort}}{{end}}

# {{with .Metadata}}{{ if ne .system_prompt "" }}{{ .system_prompt }}{{ end }}{{else}}You are a friendly and helpful assistant.{{ end }}<|end|>{{- .Input -}}<|start|>assistant

# Tools

## functions

namespace functions {
{{-range .Functions}}
{{if .Description }}
// {{ .Description }}
{{- end }}
{{- if and .Parameters.Properties (gt (len .Parameters.Properties) 0) }}
type {{ .Name }} = (_: {
{{- range $name, $prop := .Parameters.Properties }}
{{- if $prop.Description }}
  // {{ $prop.Description }}
{{- end }}
  {{ $name }}: {{ if gt (len $prop.Type) 1 }}{{ range $i, $t := $prop.Type }}{{ if $i }} | {{ end }}{{ $t }}{{ end }}{{ else }}{{ index $prop.Type 0 }}{{ end }},
{{- end }}
}) => any;
{{- else }}
type {{ .Function.Name }} = () => any;
{{- end }}
{{- end }}{{/* end of range .Functions */}}
} // namespace functions

# Instructions

<|end|>{{.Input -}}<|start|>assistant UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil> Multimodal: JinjaTemplate:false ReplyPrefix:} KnownUsecaseStrings:[FLAG_ANY FLAG_COMPLETION FLAG_CHAT] KnownUsecases:0x140005e3350 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0x140005e3308 MirostatTAU:0x140005e3328 Mirostat:0x140005e3330 NGPULayers:<nil> MMap:0x140005e3338 MMlock:0x140005e3339 LowVRAM:0x140005e333a Reranking:0x140005e333b Grammar: StopWords:[<|im_end|> <dummy32000> </s> <|endoftext|> <|return|>] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0x140005e3340 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention:<nil> NoKVOffloading:false CacheTypeK: CacheTypeV: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[] Overrides:[]}
9:02AM DBG Parameters: &{PredictionOptions:{BasicModelRequest:{Model:gpt-oss-20b-mxfp4.gguf} Language: Translate:false N:0 TopP:0x140005e3280 TopK:0x140005e32a0 Temperature:0x140005e32a8 Maxtokens:0x140005e32c8 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0x140005e32d0 TypicalP:0x140005e32d8 Seed:0x140005e32e0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:gpt-oss-20b F16:0x140005e32f0 Threads:0x140005e32f8 Debug:0x140004166d8 Roles:map[] Embeddings:0x140005e3301 Backend:mlx TemplateConfig:{Chat:<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: {{ now | date "Mon Jan 2 15:04:05 MST 2006" }}

Reasoning: {{if eq .ReasoningEffort ""}}medium{{else}}{{.ReasoningEffort}}{{end}}

# {{with .Metadata}}{{ if ne .system_prompt "" }}{{ .system_prompt }}{{ end }}{{else}}You are a friendly and helpful assistant.{{ end }}<|end|>{{- .Input -}}<|start|>assistant ChatMessage:<|start|>{{ if .FunctionCall -}}functions.{{ .FunctionCall.Name }} to=assistant{{ else if eq .RoleName "assistant"}}assistant<|channel|>final<|message|>{{else}}{{ .RoleName }}{{end}}<|message|>
{{- if .Content -}}
{{- .Content -}}
{{- end -}}
{{- if .FunctionCall -}}
{{- toJson .FunctionCall -}}
{{- end -}}<|end|> Completion:{{.Input}}
 Edit: Functions:<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: {{ now | date "Mon Jan 2 15:04:05 MST 2006" }}

Reasoning: {{if eq .ReasoningEffort ""}}medium{{else}}{{.ReasoningEffort}}{{end}}

# {{with .Metadata}}{{ if ne .system_prompt "" }}{{ .system_prompt }}{{ end }}{{else}}You are a friendly and helpful assistant.{{ end }}<|end|>{{- .Input -}}<|start|>assistant

# Tools

## functions

namespace functions {
{{-range .Functions}}
{{if .Description }}
// {{ .Description }}
{{- end }}
{{- if and .Parameters.Properties (gt (len .Parameters.Properties) 0) }}
type {{ .Name }} = (_: {
{{- range $name, $prop := .Parameters.Properties }}
{{- if $prop.Description }}
  // {{ $prop.Description }}
{{- end }}
  {{ $name }}: {{ if gt (len $prop.Type) 1 }}{{ range $i, $t := $prop.Type }}{{ if $i }} | {{ end }}{{ $t }}{{ end }}{{ else }}{{ index $prop.Type 0 }}{{ end }},
{{- end }}
}) => any;
{{- else }}
type {{ .Function.Name }} = () => any;
{{- end }}
{{- end }}{{/* end of range .Functions */}}
} // namespace functions

# Instructions

<|end|>{{.Input -}}<|start|>assistant UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil> Multimodal: JinjaTemplate:false ReplyPrefix:} KnownUsecaseStrings:[FLAG_ANY FLAG_COMPLETION FLAG_CHAT] KnownUsecases:0x140005e3350 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0x140005e3308 MirostatTAU:0x140005e3328 Mirostat:0x140005e3330 NGPULayers:<nil> MMap:0x140005e3338 MMlock:0x140005e3339 LowVRAM:0x140005e333a Reranking:0x140005e333b Grammar: StopWords:[<|im_end|> <dummy32000> </s> <|endoftext|> <|return|>] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0x140005e3340 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention:<nil> NoKVOffloading:false CacheTypeK: CacheTypeV: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[] Overrides:[]}
9:02AM DBG templated message for chat: <|start|>user<|message|>test<|end|>
9:02AM DBG Prompt (before templating): <|start|>user<|message|>test<|end|>
9:02AM DBG Template found, input modified to: <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: Tue Sep 9 09:02:58 CEST 2025

Reasoning: medium

# You are a friendly and helpful assistant.<|end|><|start|>user<|message|>test<|end|><|start|>assistant
9:02AM DBG Prompt (after templating): <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: Tue Sep 9 09:02:58 CEST 2025

Reasoning: medium

# You are a friendly and helpful assistant.<|end|><|start|>user<|message|>test<|end|><|start|>assistant
9:02AM DBG Stream request received
9:02AM INF Success ip=10.20.111.103 latency=591.279083ms method=POST status=200 url=/v1/chat/completions
9:02AM DBG Sending chunk: {"created":1757401378,"object":"chat.completion.chunk","id":"0d9c24d9-2e09-4b77-a0d8-1c2bdf79afb6","model":"gpt-oss-20b","choices":[{"index":0,"finish_reason":"","delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

9:02AM INF BackendLoader starting backend=mlx modelID=gpt-oss-20b o.model=gpt-oss-20b-mxfp4.gguf
9:02AM DBG Loading model in memory from file: /Users/REDACTED/models/gpt-oss-20b-mxfp4.gguf
9:02AM DBG Loading Model gpt-oss-20b with gRPC (file: /Users/REDACTED/models/gpt-oss-20b-mxfp4.gguf) (backend: mlx): {backendString:mlx model:gpt-oss-20b-mxfp4.gguf modelID:gpt-oss-20b context:{emptyCtx:{}} gRPCOptions:0x1400065af08 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 parallelRequests:false}
9:02AM DBG Loading external backend: /Users/REDACTED/localai/backends/mlx/run.sh
9:02AM DBG external backend is file: &{name:run.sh size:193 mode:493 modTime:{wall:0 ext:63892527909 loc:0x104c02440} sys:{Dev:16777232 Mode:33261 Nlink:1 Ino:5618613 Uid:501 Gid:20 Rdev:0 Pad_cgo_0:[0 0 0 0] Atimespec:{Sec:1757399946 Nsec:600364638} Mtimespec:{Sec:1756931109 Nsec:0} Ctimespec:{Sec:1757399667 Nsec:939316153} Birthtimespec:{Sec:1756931109 Nsec:0} Size:193 Blocks:8 Blksize:4096 Flags:0 Gen:0 Lspare:0 Qspare:[0 0]}}
9:02AM DBG Loading GRPC Process: /Users/REDACTED/localai/backends/mlx/run.sh
9:02AM DBG GRPC Service for gpt-oss-20b will be running at: '127.0.0.1:59816'
9:02AM DBG GRPC Service state dir: /var/folders/lq/trm5_hc94p17w24ypntmkjpm0000gn/T/go-processmanager3769562646
9:02AM DBG GRPC Service Started
9:02AM DBG Wait for the service to start up
9:02AM DBG Options: ContextSize:8192  Seed:806729975  NBatch:512  F16Memory:true  MMap:true  NGPULayers:9999999  Threads:10  FlashAttention:"auto"
9:02AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stdout Initializing libbackend for mlx
9:02AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stdout Using portable Python
9:02AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr /Users/REDACTED/LocalAI/backends/mlx/venv/lib/python3.10/site-packages/transformers/utils/hub.py:111: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
9:02AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr   warnings.warn(
9:02AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr Server started. Listening on: 127.0.0.1:59816
9:03AM DBG GRPC Service Ready
9:03AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0x140007d7958} sizeCache:0 unknownFields:[] Model:gpt-oss-20b-mxfp4.gguf ContextSize:8192 Seed:806729975 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:9999999 MainGPU: TensorSplit: Threads:10 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/Users/REDACTED/models/gpt-oss-20b-mxfp4.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath:/Users/REDACTED/models LoraAdapters:[] LoraScales:[] Options:[] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[]}
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr Loading MLX model: gpt-oss-20b-mxfp4.gguf
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr Request: Model: "gpt-oss-20b-mxfp4.gguf"
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr ContextSize: 8192
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr Seed: 806729975
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr NBatch: 512
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr F16Memory: true
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr MMap: true
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr NGPULayers: 9999999
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr Threads: 10
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr ModelFile: "/Users/REDACTED/models/gpt-oss-20b-mxfp4.gguf"
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr FlashAttention: "auto"
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr ModelPath: "/Users/REDACTED/models"
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr Options: {}
9:03AM DBG GRPC(gpt-oss-20b-127.0.0.1:59816): stderr Error loading MLX model err=RepositoryNotFoundError('401 Client Error. (Request ID: Root=REDACTED)\n\nRepository Not Found for url: https://huggingface.co/api/models/gpt-oss-20b-mxfp4.gguf/revision/main.\nPlease make sure you specified the correct `repo_id` and `repo_type`.\nIf you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication\nInvalid username or password.'), type(err)=<class 'huggingface_hub.errors.RepositoryNotFoundError'>
9:03AM ERR Failed to load model gpt-oss-20b with backend mlx error="failed to load model with internal loader: could not load model (no success): Error loading MLX model: 401 Client Error. (Request ID: Root=REDACTED)\n\nRepository Not Found for url: https://huggingface.co/api/models/gpt-oss-20b-mxfp4.gguf/revision/main.\nPlease make sure you specified the correct `repo_id` and `repo_type`.\nIf you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication\nInvalid username or password." modelID=gpt-oss-20b
9:03AM DBG No choices in the response, skipping
9:03AM ERR Stream ended with error: failed to load model with internal loader: could not load model (no success): Error loading MLX model: 401 Client Error. (Request ID: Root=REDACTED)

Repository Not Found for url: https://huggingface.co/api/models/gpt-oss-20b-mxfp4.gguf/revision/main.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication
Invalid username or password.

Additional context It's unlikely to be a model fault since changing the backend back to llama-cpp works without raising any error.

johndev168 avatar Sep 09 '25 07:09 johndev168

It doesn't work in that way - see the docs in the release notes: https://github.com/mudler/LocalAI/releases/tag/v3.5.0

You need to add a model similar to this configuration ( as an example here it's gemma, but replace it with the model you want to try it out)

name: mlx-gemma
backend: mlx-vlm
parameters:
  model: "mlx-community/gemma-3n-E2B-it-4bit"
template:
  use_tokenizer_template: true
known_usecases:
- chat

Note that this example uses the mlx-vlm backend because of gemma being multimodal. For models like gpt-oss you want to set the backend to mlx

mudler avatar Sep 10 '25 12:09 mudler

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Dec 10 '25 02:12 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Dec 15 '25 02:12 github-actions[bot]