LocalAI
LocalAI copied to clipboard
"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}}
LocalAI version: Latest
Environment, CPU architecture, OS, and Version: EC-2
Describe the bug Getting the grpc connection error when running using cuda12 image. But when running through Vanilla/cpu image, its working fine. Using docker-compose to start the server.
To Reproduce curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "luna-ai-llama2", "prompt": "A long time ago in a galaxy far, far away","temperature": 0.7}'
Expected behavior I need to run llm on GPU for inference, tried all images available but still same error persists
Logs
12:08PM INF Trying to load the model 'luna-ai-llama2' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/diffusers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/vall-e-x/run.sh 12:08PM INF [llama-cpp] Attempting to load 12:08PM INF Loading model 'luna-ai-llama2' with backend llama-cpp 12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37313:connect: connection refused" 12:09PM INF [llama-cpp] Fails: grpc service not ready 12:09PM INF [llama-ggml] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend llama-ggml 12:09PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF 12:09PM INF [gpt4all] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend gpt4all 12:09PM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 12:09PM INF [bert-embeddings] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend bert-embeddings 12:09PM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 12:09PM INF [rwkv] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend rwkv 12:09PM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF 12:09PM INF [whisper] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend whisper 12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35143:connect: connection refused" 12:09PM INF [whisper] Fails: grpc service not ready 12:09PM INF [stablediffusion] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend
Additional context I think people have faced similar problem earlier also but I couldn't find any solution. Kindly let me know if someone have any workarounds!!!!!
hi, I can confirm im getting the same issue on master (it was pulled after v2.11 cuda cublas12-ffmpeg images became available)
2:46PM DBG Model already loaded in memory: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38737: connect: connection refused"
2:46PM WRN Deleting the process in order to recreate it
2:46PM DBG GRPC Process is not responding: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM DBG Stopping all backends except 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf'
2:46PM INF Trying to load the model 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/exllama2/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vall-e-x/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/diffusers/run.sh
2:46PM INF [llama-cpp] Attempting to load
2:46PM INF Loading model 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf' with backend llama-cpp
2:46PM DBG Model already loaded in memory: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38737: connect: connection refused"
2:46PM WRN Deleting the process in order to recreate it
2:46PM DBG GRPC Process is not responding: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
I confirm the same issue. it's critical
Can you please share the logs with DEBUG=true? also, how are you using the image? is with a GPU I suppose?
Hello @mudler I posted some of the logs above, would you like to see more?
@Anto79-ops your log looks like incomplete, it seems it failed initially in a way that made the previous calls failing. Can you share the full log from the beginning of the session?
@mudler is it ok I email/dm a text file of the logs?
I just pulled the lastest master image and the problem is solved (for me, at least).
Thank you!
https://github.com/mudler/LocalAI/issues/1981 is related
you get this error because llama-cpp backend tries to offload whole model to GPU and fail because you have not enough VRAM
Workaround might be if you offload only part of your model layers to GPU
You need to create .yaml config file for your model like this:
name: wizard-uncensored-13b
f16: false # true to GPU acceleration
cuda: false # true to GPU acceleration
gpu_layers: 10 # this model have max 40 layers, 15-20 is reccomended for half-load at NVIDIA 4060 TiTan (more layers -- more VRAM required), (i guess 0 is no GPU)
parameters:
model: wizard-uncensored-13b.gguf
#backend: diffusers
template:
chat: &template |
Instruct: {{.Input}}
Output:
# Modify the prompt template here ^^^ as per your requirements
completion: *template
you should play aroung gpu_layers here, and check nvidia-smi
I have this error with a custom model NeuralHermes. I have asked for help https://github.com/mudler/LocalAI/discussions/1992
I have this error with a custom model NeuralHermes. I have asked for help #1992
Have you checked that your VRAM is enough to offload all layers? you can try to split it
@JackBekket is running in my preprod server
nvidia L4 32 cores 90 GB
the models that comes with the distro are running perfectly.
@mudler I have the answer I downloaded the raw link file that its just plain text 🤦 thanks for your help
You're welcome! I'm glad you found the issue and managed to resolve it. If you need any further assistance, don't hesitate to reach out. Have a great day!
I'm having a similar issue. The following log:
api-1 | 9:50PM DBG Extracting backend assets files to /tmp/localai/backend_data
api-1 | 9:50PM DBG processing api keys runtime update
api-1 | 9:50PM DBG processing external_backends.json
api-1 | 9:50PM DBG external backends loaded from external_backends.json
api-1 | 9:50PM INF core/startup process completed!
api-1 | 9:50PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json
api-1 | 9:50PM DBG No configuration file found at /tmp/localai/config/assistants.json
api-1 | 9:50PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json
api-1 | 9:50PM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080
api-1 | 9:50PM DBG Request received: {"model":"gte-qwen","language":"","translate":false,"n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"repeat_last_n":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","size":"","prompt":null,"instruction":"","input":"Your text string goes here","stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"grammar_json_name":null,"backend":"","model_base_name":""}
api-1 | 9:50PM DBG guessDefaultsFromFile: not a GGUF file
api-1 | 9:50PM DBG Parameter Config: &{PredictionOptions:{Model:Alibaba-NLP/gte-Qwen2-7B-instruct Language: Translate:false N:0 TopP:0x4000630b90 TopK:0x4000630b68 Temperature:0x4000630a18 Maxtokens:0x4000630fc8 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0x4000630fc0 TypicalP:0x4000630f08 Seed:0x40006310a0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gte-qwen F16:0x4000630cb0 Threads:0x4000630cb8 Debug:0x4000585ab0 Roles:map[] Embeddings:0x4000630fe9 Backend:huggingface-embeddings TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[] InputStrings:[Your text string goes here] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0x4000630f00 MirostatTAU:0x4000630ee8 Mirostat:0x4000630ee0 NGPULayers:0x4000630fe0 MMap:0x4000630a17 MMlock:0x4000630fe9 LowVRAM:0x4000630fe9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0x4000630c30 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
api-1 | 9:50PM INF Loading model 'Alibaba-NLP/gte-Qwen2-7B-instruct' with backend huggingface-embeddings
api-1 | 9:50PM DBG Loading model in memory from file: /models/Alibaba-NLP/gte-Qwen2-7B-instruct
api-1 | 9:50PM DBG Loading Model Alibaba-NLP/gte-Qwen2-7B-instruct with gRPC (file: /models/Alibaba-NLP/gte-Qwen2-7B-instruct) (backend: huggingface-embeddings): {backendString:huggingface-embeddings model:Alibaba-NLP/gte-Qwen2-7B-instruct threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0x4000239b08 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
api-1 | 9:50PM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
api-1 | 9:50PM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
api-1 | 9:50PM DBG GRPC Service for Alibaba-NLP/gte-Qwen2-7B-instruct will be running at: '127.0.0.1:33329'
api-1 | 9:50PM DBG GRPC Service state dir: /tmp/go-processmanager1272549319
api-1 | 9:50PM DBG GRPC Service Started
api-1 | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout Initializing libbackend for build
api-1 | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout virtualenv created
**api-1 | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 78: uv: command not found**
**api-1 | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 83:** /build/backend/python/sentencetransformers/venv/bin/activate: No such file or directory
**api-1 | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 155: exec: python: not found**
api-1 | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout virtualenv activated
api-1 | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout activated virtualenv has been ensured
api-1 | 9:51PM ERR failed starting/connecting to the gRPC service error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:33329: connect: connection refused\""
api-1 | 9:51PM DBG GRPC Service NOT ready
api-1 | 9:51PM ERR Server error error="grpc service not ready" ip=192.168.65.1 latency=40.12671406s method=POST status=500 url=/embeddings
I've highlighted the lines that sort of stood out to me. It would be good to have customized model files with examples using different backends.