LocalAI
LocalAI copied to clipboard
docker container with CUDA12
LocalAI version:
Environment, CPU architecture, OS, and Version: Linux fedora 6.5.6-300.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:57:21 UTC 2023 x86_64 GNU/Linux
Describe the bug
Trying to follow https://localai.io/howtos/easy-model-import-gallery/
I'd like to use CUDA. Installed toolkit, rebooted
nvidia-smi
Mon Oct 16 19:05:10 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 ... Off | 00000000:01:00.0 Off | N/A |
| 0% 50C P8 15W / 175W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Followed https://localai.io/howtos/easy-setup-docker-gpu/
Recompiled / rebuilt container etc
I get:
stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: CUDA driver version is insufficient for CUDA runtime version
Why that? I compiled everything on this fresh fedora box. Where is the mismatch?
To Reproduce
Expected behavior Working container using CUDA
Logs
localai-api-1 | I local-ai build info:
localai-api-1 | I BUILD_TYPE: cublas
localai-api-1 | I GO_TAGS:
localai-api-1 | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"
localai-api-1 | CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"" -tags "" -o local-ai ./
localai-api-1 | 5:02PM INF Starting LocalAI using 4 threads, with models path: /models
localai-api-1 | 5:02PM INF LocalAI version: 8034ed3 (8034ed3473fb1c8c6f5e3864933c442b377be52e)
localai-api-1 | 5:02PM DBG Model: lunademo (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
localai-api-1 | 5:02PM DBG Extracting backend assets files to /tmp/localai/backend_data
localai-api-1 |
localai-api-1 | ┌───────────────────────────────────────────────────┐
localai-api-1 | │ Fiber v2.49.2 │
localai-api-1 | │ http://127.0.0.1:8080 │
localai-api-1 | │ (bound on host 0.0.0.0 and port 8080) │
localai-api-1 | │ │
localai-api-1 | │ Handlers ............ 71 Processes ........... 1 │
localai-api-1 | │ Prefork ....... Disabled PID ............. 10497 │
localai-api-1 | └───────────────────────────────────────────────────┘
localai-api-1 |
localai-api-1 | [172.22.0.1]:34580 405 - GET /v1/chat/completions
localai-api-1 | 5:02PM DBG Request received:
localai-api-1 | 5:02PM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1 | 5:02PM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1 | 5:02PM DBG Prompt (before templating): USER: How are you?
localai-api-1 | 5:02PM DBG Template found, input modified to: USER: How are you?
localai-api-1 |
localai-api-1 | ASSISTANT:
localai-api-1 |
localai-api-1 | 5:02PM DBG Prompt (after templating): USER: How are you?
localai-api-1 |
localai-api-1 | ASSISTANT:
localai-api-1 |
localai-api-1 | 5:02PM DBG Loading model llama from luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1 | 5:02PM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1 | 5:02PM DBG Loading GRPC Model llama: {backendString:llama model:luna-ai-llama2-uncensored.Q4_0.gguf threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000102820 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
localai-api-1 | 5:02PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1 | 5:02PM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_0.gguf will be running at: '127.0.0.1:33301'
localai-api-1 | 5:02PM DBG GRPC Service state dir: /tmp/go-processmanager3078149800
localai-api-1 | 5:02PM DBG GRPC Service Started
localai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33301: connect: connection refused"
localai-api-1 | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr 2023/10/16 17:02:42 gRPC Server listening at 127.0.0.1:33301
localai-api-1 | 5:02PM DBG GRPC Service Ready
localai-api-1 | 5:02PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_0.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:4 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
localai-api-1 | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr create_gpt_params_cuda: loading model /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1 | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr
localai-api-1 | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: CUDA driver version is insufficient for CUDA runtime version
localai-api-1 | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr current device: 19566432
localai-api-1 | [172.22.0.1]:48336 500 - POST /v1/chat/completions
localai-api-1 | [127.0.0.1]:46348 200 - GET /readyz
Additional context
@stefangweichinger I had the same error but in a different context. As the LocalAI docker images are not based on the official cuda images by nvidia, you might need to explicitely set the NVIDIA_VISIBLE_DEVICES
env variable when running the container.
(You could just add NVIDIA_VISIBLE_DEVICES=all
to the .env file.)
We already set NVDIA env at https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97 Dockerfile.
thanks to @djmaze and @Aisuko .. yes, that variable is in the Dockerfile. So how to proceed? I assume there are maybe mismatches between packages from Fedora and Nvidia? I installed Nvidia-stuff from here.
The link is for Fedora 37 ... nothing available for F38 or my F39beta. So maybe the problem comes from that.
I am a newbie with LocalAI and CUDA. So I am only guessing.
Corrected my docker-compose.yml to the one from here
Added the mentioned variable to .env
, yes, redundant.
Toggled "REBUILD" (btw: how do I keep my rebuilt image once it's OK? just toggle the variable to no/false?), restarted. Rebuild ran through, I get this:
localai-api-1 | I local-ai build info:
localai-api-1 | I BUILD_TYPE: cublas
localai-api-1 | I GO_TAGS:
localai-api-1 | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"
localai-api-1 | CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"" -tags "" -o local-ai ./
localai-api-1 | 6:35AM INF Starting LocalAI using 8 threads, with models path: /models
localai-api-1 | 6:35AM INF LocalAI version: 8034ed3 (8034ed3473fb1c8c6f5e3864933c442b377be52e)
localai-api-1 | 6:35AM DBG Model: lunademo (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
localai-api-1 | 6:35AM DBG Extracting backend assets files to /tmp/localai/backend_data
localai-api-1 |
localai-api-1 | ┌───────────────────────────────────────────────────┐
localai-api-1 | │ Fiber v2.49.2 │
localai-api-1 | │ http://127.0.0.1:8080 │
localai-api-1 | │ (bound on host 0.0.0.0 and port 8080) │
localai-api-1 | │ │
localai-api-1 | │ Handlers ............ 71 Processes ........... 1 │
localai-api-1 | │ Prefork ....... Disabled PID ............. 10493 │
localai-api-1 | └───────────────────────────────────────────────────┘
localai-api-1 |
localai-api-1 | [127.0.0.1]:50752 200 - GET /readyz
localai-api-1 | 6:36AM DBG Request received:
localai-api-1 | 6:36AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:8 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1 | 6:36AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:8 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1 | 6:36AM DBG Prompt (before templating): USER: How are you?
localai-api-1 | 6:36AM DBG Template found, input modified to: USER: How are you?
localai-api-1 |
localai-api-1 | ASSISTANT:
localai-api-1 |
localai-api-1 | 6:36AM DBG Prompt (after templating): USER: How are you?
localai-api-1 |
localai-api-1 | ASSISTANT:
localai-api-1 |
localai-api-1 | 6:36AM DBG Loading model llama from luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1 | 6:36AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1 | 6:36AM DBG Loading GRPC Model llama: {backendString:llama model:luna-ai-llama2-uncensored.Q4_0.gguf threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0005d4680 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
localai-api-1 | 6:36AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1 | 6:36AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_0.gguf will be running at: '127.0.0.1:37101'
localai-api-1 | 6:36AM DBG GRPC Service state dir: /tmp/go-processmanager834974386
localai-api-1 | 6:36AM DBG GRPC Service Started
localai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37101: connect: connection refused"
localai-api-1 | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr 2023/10/17 06:36:28 gRPC Server listening at 127.0.0.1:37101
localai-api-1 | 6:36AM DBG GRPC Service Ready
localai-api-1 | 6:36AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_0.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:4 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
localai-api-1 | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr create_gpt_params_cuda: loading model /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1 | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr
localai-api-1 | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr CUDA error 100 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: no CUDA-capable device is detected
localai-api-1 | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr current device: 19566432
localai-api-1 | [172.22.0.1]:37050 500 - POST /v1/chat/completions
So the build is with CUBLAS, but "CUDA:false" and "no CUDA-capable device".
While:
$ nvidia-smi
Tue Oct 17 08:35:08 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 ... Off | 00000000:01:00.0 Off | N/A |
| 0% 48C P8 15W / 175W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
What am I missing? thanks for any help here.
I wonder if the docker-compose syntax is OK in my case. Especially the "deploy:"
version: '3.6'
services:
api:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
image: quay.io/go-skynet/local-ai:master-cublas-cuda12
tty: true # enable colorized logs
restart: always # should this be on-failure ?
ports:
- 8080:8080
env_file:
- .env
volumes:
- ./models:/models
command: ["/usr/bin/local-ai" ]
EDIT: Or is that luna-demo-model not working with CUDA? As I said: I am guessing ;-)
I now went on to play with examples/chatbot-ui
and try to get that to work with CUDA. That one uses another model etc / it works but runs on the CPU only.
My edited config:
$ cat docker-compose.yaml
version: '3.6'
services:
api:
image: quay.io/go-skynet/local-ai:master-cublas-cuda12
# As initially LocalAI will download the models defined in PRELOAD_MODELS
# you might need to tweak the healthcheck values here according to your network connection.
# Here we give a timespan of 20m to download all the required files.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 1m
timeout: 20m
retries: 20
build:
context: ../../
dockerfile: Dockerfile
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- 8080:8080
environment:
- DEBUG=true
- MODELS_PATH=/models
# You can preload different models here as well.
# See: https://github.com/go-skynet/model-gallery
- 'PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"}]'
volumes:
- ./models:/models:cached,Z
command: ["/usr/bin/local-ai" ]
chatgpt:
depends_on:
api:
condition: service_healthy
image: ghcr.io/mckaywrigley/chatbot-ui:main
ports:
- 3000:3000
environment:
- 'OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXX'
- 'OPENAI_API_HOST=http://api:8080'
- 'NVIDIA_VISIBLE_DEVICES=all'
- 'CUDA_VISIBLE_DEVICES=all'
- 'CUDA_DEVICE_POOL_GPU_OVERRIDE=1'
We already set NVDIA env at
https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97
Dockerfile.
That is correct, but it is only set in the intermediate builder image, but not in the final image. (You can also see the final image contents here).
One could argue that this is a bug.
thanks @djmaze I also noticed that I set the env variables for the chatbot-ui container and not for the api. Switching that, testing ... hmm, no.
I have now:
$ cat docker-compose.yaml
version: '3.6'
services:
api:
image: quay.io/go-skynet/local-ai:master-cublas-cuda12
# As initially LocalAI will download the models defined in PRELOAD_MODELS
# you might need to tweak the healthcheck values here according to your network connection.
# Here we give a timespan of 20m to download all the required files.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 1m
timeout: 20m
retries: 20
build:
context: ../../
dockerfile: Dockerfile
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- 8080:8080
environment:
- DEBUG=true
- MODELS_PATH=/models
- 'NVIDIA_VISIBLE_DEVICES=all'
- 'CUDA_VISIBLE_DEVICES=all'
- 'CUDA_DEVICE_POOL_GPU_OVERRIDE=1'
# You can preload different models here as well.
# See: https://github.com/go-skynet/model-gallery
- 'PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"},
{"url": "github:go-skynet/model-gallery/llama2-chat.yaml", "name": "llama2-chat"},
{"url": "github:go-skynet/model-gallery/stablediffusion.yaml", "name": "stablediffusion"}]'
volumes:
- ./models:/models:cached,Z
command: ["/usr/bin/local-ai" ]
chatgpt:
depends_on:
api:
condition: service_healthy
image: ghcr.io/mckaywrigley/chatbot-ui:main
ports:
- 3000:3000
environment:
- 'OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXX'
- 'OPENAI_API_HOST=http://api:8080'
What do you think?
I also edited /etc/nvidia-container-runtime/config.toml
to fix the permissions.
I run the api-container in privileged mode now ... still CUDA isn't used as far as I understand.
Used different docker images:
- quay.io/go-skynet/local-ai master-cublas-cuda12
- quay.io/go-skynet/local-ai v1.18.0-cublas-cuda12-ffmpeg
- quay.io/go-skynet/local-ai latest # yes, this one does not have CUDA
Rebuilt images.
Still I don't see any processes in nvidia-smi
when I run my LocalAI-stack.
I think it is SElinux:
Okt 17 16:10:39 fedora audit[3085]: AVC avc: denied { getattr } for pid=3085 comm="nvidia-smi" path="/dev/nvidiactl" dev="devtmpfs" ino=796 scontext=system_u:system_r:container_t:s0:c566,c905 tcontext=system_u:object_r:xserver_misc_device_t:s0 tclass=chr_file permissive=1
Okt 17 16:10:39 fedora 5840bb46c0e3[1199]: Failed to initialize NVML: Unknown Error
I have it on permissive already. Will see how to fix that.
In other projects I can use CUDA within docker OK. Just telling.
just coming back to this project, im surprised that cuda accel support is listed as a feature but it fails to work with latest containers. I have duplicated issues in both master branch and v1.30...
I'm on Ubuntu 22 LTS and I get the same error:
2:46PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:44099' 2:46PM DBG GRPC Service state dir: /tmp/go-processmanager1917593952 2:46PM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44099: connect: connection refused
No matter what I do. If I use any model with CPU it works, however using any model with GPU support gives me that error. I've tried disable UFW and restarting and still get the same.
Has anyone made any progress with this error? Just as an FYI, I manage to run Frigate, Plex, and text-webui-AI and they each reach the GPU fine, so I don't think it's anything with my setup.
using 22.04, this command runs and uses gpu for me (nvidia rtx 3090)
docker run --gpus all --user 1000:1000 -p 5000:8080 -v /mnt/a/ml/LocalAI/models:/models -ti --rm quay.io/go-skynet/local-ai:master-cublas-cuda11-ffmpeg --models-path=/models --context-size=16384 --threads=16 --f16=true --debug=true --single-active-backend=true
replace --user argument with your userid (helps if you use model/apply to download models from gallery) -v argument with your path to models dir --context-size argument for your desired value --threads for your desired value
I had figured my error out. It still says that it can't connect, but you can ignore that. My issue was that there wasn't all four files available for the model I was using. The template, the chat, and the completion, and whatnot.
There was another integration that did that; it has the model output the request in JSON, and it used it's own integration in order to take the json request and send it to the Home Assistant API.
I wasn't able to edit it to work, however, as it was looking for OpenAI integration to modify instead of Extended_OpenAI or Custom_OpenAI, but the blog and integration is listed here if anyone wants to try that as well;
https://blog.teagantotally.rocks/2023/06/05/openai-home-assistant/
I haven't had much of a chance to mess around this week what with Thanksgiving and all, but I'm glad to see more people getting interested in it!
using 22.04, this command runs and uses gpu for me (nvidia rtx 3090)
docker run --gpus all --user 1000:1000 -p 5000:8080 -v /mnt/a/ml/LocalAI/models:/models -ti --rm quay.io/go-skynet/local-ai:master-cublas-cuda11-ffmpeg --models-path=/models --context-size=16384 --threads=16 --f16=true --debug=true --single-active-backend=true
replace --user argument with your userid (helps if you use model/apply to download models from gallery) -v argument with your path to models dir --context-size argument for your desired value --threads for your desired value
I am getting below issue if add --gpus all
.
could not select device driver "" with capabilities: [[gpu]].
@thiner Did you install the Nvidia Container Toolkit? It is required to run Docker Containers with Cuda support.
@thiner Did you install the Nvidia Container Toolkit? It is required to run Docker Containers with Cuda support.
You are right. The issue is encountered by missing gpu driver. I deployed the LocalAI image to k8s cluster, but didn't realize that the cluster node need install driver firstly. Problem has already been solved. Thanks for your reply.
Hello,
I'm having the same issue. The Container detects the GPU, but it uses the CPU all the time. That are the logs, that show that it detects the GPU:
11:37AM DBG GRPC Service Ready
11:37AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:em_german_mistral_v01.Q4_0.gguf ContextSize:16384 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/em_german_mistral_v01.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: found 1 CUDA devices:
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Both NVIDIA Drivers and nvidia-container-toolkit are installed
We already set NVDIA env at https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97 Dockerfile.
That is correct, but it is only set in the intermediate builder image, but not in the final image. (You can also see the final image contents here).
One could argue that this is a bug.
I agree. After setting NVIDIA_VISIBLE_DEVICES=all
in my .env
, i am again utilizing my Nvidia card. Prior to this, using the cuda All In One (AIO) image was failing as
localai-api-1 | 1:41PM INF GPU device found but no CUDA backend present
I am on a fresh Ubuntu 22.04 install, and after updating my nvidia-smi and various nvidia drivers was able to get side-by-side parity with Windows 10 performance