LocalAI docker container with CUDA12

LocalAI version:

Environment, CPU architecture, OS, and Version: Linux fedora 6.5.6-300.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:57:21 UTC 2023 x86_64 GNU/Linux

Describe the bug

Trying to follow https://localai.io/howtos/easy-model-import-gallery/

I'd like to use CUDA. Installed toolkit, rebooted

 nvidia-smi 
Mon Oct 16 19:05:10 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    Off | 00000000:01:00.0 Off |                  N/A |
|  0%   50C    P8              15W / 175W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Followed https://localai.io/howtos/easy-setup-docker-gpu/

Recompiled / rebuilt container etc

I get:

 stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: CUDA driver version is insufficient for CUDA runtime version

Why that? I compiled everything on this fresh fedora box. Where is the mismatch?

To Reproduce

Expected behavior Working container using CUDA

Logs

localai-api-1  | I local-ai build info:
localai-api-1  | I BUILD_TYPE: cublas
localai-api-1  | I GO_TAGS: 
localai-api-1  | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"
localai-api-1  | CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"" -tags "" -o local-ai ./
localai-api-1  | 5:02PM INF Starting LocalAI using 4 threads, with models path: /models
localai-api-1  | 5:02PM INF LocalAI version: 8034ed3 (8034ed3473fb1c8c6f5e3864933c442b377be52e)
localai-api-1  | 5:02PM DBG Model: lunademo (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
localai-api-1  | 5:02PM DBG Extracting backend assets files to /tmp/localai/backend_data
localai-api-1  | 
localai-api-1  |  ┌───────────────────────────────────────────────────┐ 
localai-api-1  |  │                   Fiber v2.49.2                   │ 
localai-api-1  |  │               http://127.0.0.1:8080               │ 
localai-api-1  |  │       (bound on host 0.0.0.0 and port 8080)       │ 
localai-api-1  |  │                                                   │ 
localai-api-1  |  │ Handlers ............ 71  Processes ........... 1 │ 
localai-api-1  |  │ Prefork ....... Disabled  PID ............. 10497 │ 
localai-api-1  |  └───────────────────────────────────────────────────┘ 
localai-api-1  | 
localai-api-1  | [172.22.0.1]:34580 405 - GET /v1/chat/completions
localai-api-1  | 5:02PM DBG Request received: 
localai-api-1  | 5:02PM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 5:02PM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 5:02PM DBG Prompt (before templating): USER: How are you?
localai-api-1  | 5:02PM DBG Template found, input modified to: USER: How are you?
localai-api-1  | 
localai-api-1  | ASSISTANT:
localai-api-1  | 
localai-api-1  | 5:02PM DBG Prompt (after templating): USER: How are you?
localai-api-1  | 
localai-api-1  | ASSISTANT:
localai-api-1  | 
localai-api-1  | 5:02PM DBG Loading model llama from luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 5:02PM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 5:02PM DBG Loading GRPC Model llama: {backendString:llama model:luna-ai-llama2-uncensored.Q4_0.gguf threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000102820 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
localai-api-1  | 5:02PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1  | 5:02PM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_0.gguf will be running at: '127.0.0.1:33301'
localai-api-1  | 5:02PM DBG GRPC Service state dir: /tmp/go-processmanager3078149800
localai-api-1  | 5:02PM DBG GRPC Service Started
localai-api-1  | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33301: connect: connection refused"
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr 2023/10/16 17:02:42 gRPC Server listening at 127.0.0.1:33301
localai-api-1  | 5:02PM DBG GRPC Service Ready
localai-api-1  | 5:02PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_0.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:4 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr create_gpt_params_cuda: loading model /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr 
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: CUDA driver version is insufficient for CUDA runtime version
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr current device: 19566432
localai-api-1  | [172.22.0.1]:48336 500 - POST /v1/chat/completions
localai-api-1  | [127.0.0.1]:46348 200 - GET /readyz

Additional context

Oct 16 '23 17:10 stefangweichinger

@stefangweichinger I had the same error but in a different context. As the LocalAI docker images are not based on the official cuda images by nvidia, you might need to explicitely set the NVIDIA_VISIBLE_DEVICES env variable when running the container.

(You could just add NVIDIA_VISIBLE_DEVICES=all to the .env file.)

Oct 16 '23 21:10 djmaze

We already set NVDIA env at https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97 Dockerfile.

Oct 17 '23 06:10 Aisuko

thanks to @djmaze and @Aisuko .. yes, that variable is in the Dockerfile. So how to proceed? I assume there are maybe mismatches between packages from Fedora and Nvidia? I installed Nvidia-stuff from here.

The link is for Fedora 37 ... nothing available for F38 or my F39beta. So maybe the problem comes from that.

I am a newbie with LocalAI and CUDA. So I am only guessing.

Oct 17 '23 06:10 stefangweichinger

Corrected my docker-compose.yml to the one from here

Added the mentioned variable to .env, yes, redundant.

Toggled "REBUILD" (btw: how do I keep my rebuilt image once it's OK? just toggle the variable to no/false?), restarted. Rebuild ran through, I get this:

localai-api-1  | I local-ai build info:
localai-api-1  | I BUILD_TYPE: cublas
localai-api-1  | I GO_TAGS: 
localai-api-1  | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"
localai-api-1  | CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"" -tags "" -o local-ai ./
localai-api-1  | 6:35AM INF Starting LocalAI using 8 threads, with models path: /models
localai-api-1  | 6:35AM INF LocalAI version: 8034ed3 (8034ed3473fb1c8c6f5e3864933c442b377be52e)
localai-api-1  | 6:35AM DBG Model: lunademo (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
localai-api-1  | 6:35AM DBG Extracting backend assets files to /tmp/localai/backend_data
localai-api-1  | 
localai-api-1  |  ┌───────────────────────────────────────────────────┐ 
localai-api-1  |  │                   Fiber v2.49.2                   │ 
localai-api-1  |  │               http://127.0.0.1:8080               │ 
localai-api-1  |  │       (bound on host 0.0.0.0 and port 8080)       │ 
localai-api-1  |  │                                                   │ 
localai-api-1  |  │ Handlers ............ 71  Processes ........... 1 │ 
localai-api-1  |  │ Prefork ....... Disabled  PID ............. 10493 │ 
localai-api-1  |  └───────────────────────────────────────────────────┘ 
localai-api-1  | 
localai-api-1  | [127.0.0.1]:50752 200 - GET /readyz
localai-api-1  | 6:36AM DBG Request received: 
localai-api-1  | 6:36AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:8 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 6:36AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:8 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 6:36AM DBG Prompt (before templating): USER: How are you?
localai-api-1  | 6:36AM DBG Template found, input modified to: USER: How are you?
localai-api-1  | 
localai-api-1  | ASSISTANT:
localai-api-1  | 
localai-api-1  | 6:36AM DBG Prompt (after templating): USER: How are you?
localai-api-1  | 
localai-api-1  | ASSISTANT:
localai-api-1  | 
localai-api-1  | 6:36AM DBG Loading model llama from luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 6:36AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 6:36AM DBG Loading GRPC Model llama: {backendString:llama model:luna-ai-llama2-uncensored.Q4_0.gguf threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0005d4680 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
localai-api-1  | 6:36AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1  | 6:36AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_0.gguf will be running at: '127.0.0.1:37101'
localai-api-1  | 6:36AM DBG GRPC Service state dir: /tmp/go-processmanager834974386
localai-api-1  | 6:36AM DBG GRPC Service Started
localai-api-1  | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37101: connect: connection refused"
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr 2023/10/17 06:36:28 gRPC Server listening at 127.0.0.1:37101
localai-api-1  | 6:36AM DBG GRPC Service Ready
localai-api-1  | 6:36AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_0.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:4 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr create_gpt_params_cuda: loading model /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr 
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr CUDA error 100 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: no CUDA-capable device is detected
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr current device: 19566432
localai-api-1  | [172.22.0.1]:37050 500 - POST /v1/chat/completions

So the build is with CUBLAS, but "CUDA:false" and "no CUDA-capable device".

While:

$ nvidia-smi 
Tue Oct 17 08:35:08 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    Off | 00000000:01:00.0 Off |                  N/A |
|  0%   48C    P8              15W / 175W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

What am I missing? thanks for any help here.

Oct 17 '23 06:10 stefangweichinger

I wonder if the docker-compose syntax is OK in my case. Especially the "deploy:"

version: '3.6'

services:
  api:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12
    tty: true # enable colorized logs
    restart: always # should this be on-failure ?
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models
    command: ["/usr/bin/local-ai" ]

EDIT: Or is that luna-demo-model not working with CUDA? As I said: I am guessing ;-)

Oct 17 '23 06:10 stefangweichinger

I now went on to play with examples/chatbot-ui and try to get that to work with CUDA. That one uses another model etc / it works but runs on the CPU only.

My edited config:

$ cat docker-compose.yaml 
version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12
    # As initially LocalAI will download the models defined in PRELOAD_MODELS
    # you might need to tweak the healthcheck values here according to your network connection.
    # Here we give a timespan of 20m to download all the required files.
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 20
    build:
      context: ../../
      dockerfile: Dockerfile
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu] 
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
      - MODELS_PATH=/models
      # You can preload different models here as well.
      # See: https://github.com/go-skynet/model-gallery
      - 'PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"}]'
    volumes:
      - ./models:/models:cached,Z
    command: ["/usr/bin/local-ai" ]
  chatgpt:
    depends_on:
      api:
        condition: service_healthy
    image: ghcr.io/mckaywrigley/chatbot-ui:main
    ports:
      - 3000:3000
    environment:
      - 'OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXX'
      - 'OPENAI_API_HOST=http://api:8080'
      - 'NVIDIA_VISIBLE_DEVICES=all'
      - 'CUDA_VISIBLE_DEVICES=all'
      - 'CUDA_DEVICE_POOL_GPU_OVERRIDE=1'

Oct 17 '23 10:10 stefangweichinger

We already set NVDIA env at

https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97

Dockerfile.

That is correct, but it is only set in the intermediate builder image, but not in the final image. (You can also see the final image contents here).

One could argue that this is a bug.

Oct 17 '23 10:10 djmaze

thanks @djmaze I also noticed that I set the env variables for the chatbot-ui container and not for the api. Switching that, testing ... hmm, no.

I have now:

$ cat docker-compose.yaml 
version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12
    # As initially LocalAI will download the models defined in PRELOAD_MODELS
    # you might need to tweak the healthcheck values here according to your network connection.
    # Here we give a timespan of 20m to download all the required files.
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 20
    build:
      context: ../../
      dockerfile: Dockerfile
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu] 
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
      - MODELS_PATH=/models
      - 'NVIDIA_VISIBLE_DEVICES=all'
      - 'CUDA_VISIBLE_DEVICES=all'
      - 'CUDA_DEVICE_POOL_GPU_OVERRIDE=1'
      # You can preload different models here as well.
      # See: https://github.com/go-skynet/model-gallery
      - 'PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"},
                         {"url": "github:go-skynet/model-gallery/llama2-chat.yaml", "name": "llama2-chat"},
                         {"url": "github:go-skynet/model-gallery/stablediffusion.yaml", "name": "stablediffusion"}]'
    volumes:
      - ./models:/models:cached,Z
    command: ["/usr/bin/local-ai" ]
  chatgpt:
    depends_on:
      api:
        condition: service_healthy
    image: ghcr.io/mckaywrigley/chatbot-ui:main
    ports:
      - 3000:3000
    environment:
      - 'OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXX'
      - 'OPENAI_API_HOST=http://api:8080'

What do you think?

Oct 17 '23 11:10 stefangweichinger

I also edited /etc/nvidia-container-runtime/config.toml to fix the permissions. I run the api-container in privileged mode now ... still CUDA isn't used as far as I understand.

Used different docker images:

quay.io/go-skynet/local-ai master-cublas-cuda12
quay.io/go-skynet/local-ai v1.18.0-cublas-cuda12-ffmpeg
quay.io/go-skynet/local-ai latest # yes, this one does not have CUDA

Rebuilt images.

Still I don't see any processes in nvidia-smi when I run my LocalAI-stack.

Oct 17 '23 13:10 stefangweichinger

I think it is SElinux:

Okt 17 16:10:39 fedora audit[3085]: AVC avc:  denied  { getattr } for  pid=3085 comm="nvidia-smi" path="/dev/nvidiactl" dev="devtmpfs" ino=796 scontext=system_u:system_r:container_t:s0:c566,c905 tcontext=system_u:object_r:xserver_misc_device_t:s0 tclass=chr_file permissive=1
Okt 17 16:10:39 fedora 5840bb46c0e3[1199]: Failed to initialize NVML: Unknown Error

I have it on permissive already. Will see how to fix that.

Oct 17 '23 14:10 stefangweichinger

In other projects I can use CUDA within docker OK. Just telling.

Oct 19 '23 07:10 stefangweichinger

just coming back to this project, im surprised that cuda accel support is listed as a feature but it fails to work with latest containers. I have duplicated issues in both master branch and v1.30...

Nov 06 '23 22:11 larkinwc

I'm on Ubuntu 22 LTS and I get the same error:

2:46PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:44099' 2:46PM DBG GRPC Service state dir: /tmp/go-processmanager1917593952 2:46PM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44099: connect: connection refused

No matter what I do. If I use any model with CPU it works, however using any model with GPU support gives me that error. I've tried disable UFW and restarting and still get the same.

Has anyone made any progress with this error? Just as an FYI, I manage to run Frigate, Plex, and text-webui-AI and they each reach the GPU fine, so I don't think it's anything with my setup.

Nov 16 '23 14:11 Someguitarist

using 22.04, this command runs and uses gpu for me (nvidia rtx 3090)

docker run --gpus all --user 1000:1000 -p 5000:8080 -v /mnt/a/ml/LocalAI/models:/models -ti --rm quay.io/go-skynet/local-ai:master-cublas-cuda11-ffmpeg --models-path=/models --context-size=16384 --threads=16 --f16=true --debug=true --single-active-backend=true

replace --user argument with your userid (helps if you use model/apply to download models from gallery) -v argument with your path to models dir --context-size argument for your desired value --threads for your desired value

Nov 24 '23 15:11 vrijsinghani

I had figured my error out. It still says that it can't connect, but you can ignore that. My issue was that there wasn't all four files available for the model I was using. The template, the chat, and the completion, and whatnot.

Nov 24 '23 17:11 Someguitarist

There was another integration that did that; it has the model output the request in JSON, and it used it's own integration in order to take the json request and send it to the Home Assistant API.

I wasn't able to edit it to work, however, as it was looking for OpenAI integration to modify instead of Extended_OpenAI or Custom_OpenAI, but the blog and integration is listed here if anyone wants to try that as well;

https://blog.teagantotally.rocks/2023/06/05/openai-home-assistant/

I haven't had much of a chance to mess around this week what with Thanksgiving and all, but I'm glad to see more people getting interested in it!

Nov 24 '23 17:11 Someguitarist

using 22.04, this command runs and uses gpu for me (nvidia rtx 3090)

docker run --gpus all --user 1000:1000 -p 5000:8080 -v /mnt/a/ml/LocalAI/models:/models -ti --rm quay.io/go-skynet/local-ai:master-cublas-cuda11-ffmpeg --models-path=/models --context-size=16384 --threads=16 --f16=true --debug=true --single-active-backend=true

replace --user argument with your userid (helps if you use model/apply to download models from gallery) -v argument with your path to models dir --context-size argument for your desired value --threads for your desired value

I am getting below issue if add --gpus all. could not select device driver "" with capabilities: [[gpu]].

Dec 05 '23 10:12 thiner

@thiner Did you install the Nvidia Container Toolkit? It is required to run Docker Containers with Cuda support.

Dec 16 '23 15:12 Taronyuu

@thiner Did you install the Nvidia Container Toolkit? It is required to run Docker Containers with Cuda support.

You are right. The issue is encountered by missing gpu driver. I deployed the LocalAI image to k8s cluster, but didn't realize that the cluster node need install driver firstly. Problem has already been solved. Thanks for your reply.

Dec 16 '23 17:12 thiner

Hello,

I'm having the same issue. The Container detects the GPU, but it uses the CPU all the time. That are the logs, that show that it detects the GPU:

11:37AM DBG GRPC Service Ready
11:37AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:em_german_mistral_v01.Q4_0.gguf ContextSize:16384 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/em_german_mistral_v01.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: found 1 CUDA devices:
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr   Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes

Both NVIDIA Drivers and nvidia-container-toolkit are installed

Jan 31 '24 11:01 domi-bue

We already set NVDIA env at https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97 Dockerfile.

That is correct, but it is only set in the intermediate builder image, but not in the final image. (You can also see the final image contents here).

One could argue that this is a bug.

I agree. After setting NVIDIA_VISIBLE_DEVICES=all in my .env, i am again utilizing my Nvidia card. Prior to this, using the cuda All In One (AIO) image was failing as

localai-api-1 | 1:41PM INF GPU device found but no CUDA backend present

I am on a fresh Ubuntu 22.04 install, and after updating my nvidia-smi and various nvidia drivers was able to get side-by-side parity with Windows 10 performance

Jun 13 '24 14:06 jreusser

LocalAI LocalAI copied to clipboard

docker container with CUDA12

LocalAI
LocalAI copied to clipboard