LocalAI No GPU GRPC Backend work

LocalAI version: Docker - LocalAI version v2.7.0-12-g38e4ec0 (38e4ec0b2a00c94bdffe74a8eabb6356aca795be)

Environment, CPU architecture, OS, and Version: 6.7.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 01 Feb 2024 10:30:35 +0000 x86_64 GNU/Linux CUDA 12, RTX 4090

Describe the bug None of the tested GRPC GPU using backends work

To Reproduce

Build latest image.
Use vLLM
- get the following error:

api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr Traceback (most recent call last):
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr   File "/build/backend/python/vllm/backend_vllm.py", line 13, in <module>
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr     from vllm import LLM, SamplingParams
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/__init__.py", line 3, in <module>
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr     from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 6, in <module>
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr     from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/config.py", line 9, in <module>
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr     from vllm.utils import get_cpu_memory, is_hip
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/utils.py", line 11, in <module>
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr     from vllm._C import cuda_utils
api-1  | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr ImportError: /opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops15to_dtype_layout4callERKNS_6TensorEN3c108optionalINS5_10ScalarTypeEEENS6_INS5_6LayoutEEENS6_INS5_6DeviceEEENS6_IbEEbbNS6_INS5_12MemoryFormatEEE

Use exllama2
- get the following error:

api-1  | 9:45AM DBG GRPC Service for mistral-7b-v0.2.safetensors will be running at: '127.0.0.1:35159'                                                                                                                                                                                                                     
api-1  | 9:45AM DBG GRPC Service state dir: /tmp/go-processmanager1009413310                                                                                                                                                                                                                                               
api-1  | 9:45AM DBG GRPC Service Started                                                                                                                                                                                                                                                                                   
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr Traceback (most recent call last):                                                                                                                                                                                                           
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2096, in _run_ninja_build                                                                                                              
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     subprocess.run(                                                                                                                                                                                                                          
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/opt/conda/envs/transformers/lib/python3.11/subprocess.py", line 571, in run                                                                                                                                                         
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     raise CalledProcessError(retcode, process.args,                                                                                                                                                                                          
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.                                                                                                                                                    
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr                                                                                                                                                                                                                                              
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr The above exception was the direct cause of the following exception:                                                                                                                                                                         
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr                                                                                                                                                                                                                                              
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr Traceback (most recent call last):                                                                                                                                                                                                           
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/build/backend/python/exllama2/exllama2_backend.py", line 19, in <module>                                                                                                                                                            
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     from exllamav2.generator import (                                                                                                                                                                                                        
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/build/backend/python/exllama2/exllamav2/__init__.py", line 3, in <module>                                                                                                                                                           
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     from exllamav2.model import ExLlamaV2                                                                                                                                                                                                    
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/build/backend/python/exllama2/exllamav2/model.py", line 16, in <module>                                                                                                                                                             
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     from exllamav2.config import ExLlamaV2Config                                                                                                                                                                                             
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/build/backend/python/exllama2/exllamav2/config.py", line 2, in <module>                                                                                                                                                             
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     from exllamav2.fasttensors import STFile                                                                                                                                                                                                 
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/build/backend/python/exllama2/exllamav2/fasttensors.py", line 5, in <module>                                                                                                                                                        
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     from exllamav2.ext import exllamav2_ext as ext_c                                                                                                                                                                                         
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/build/backend/python/exllama2/exllamav2/ext.py", line 142, in <module>                                                                                                                                                              
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     exllamav2_ext = load \                                                                                                                                                                                                                   
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr                     ^^^^^^                                                                                                                                                                                                                   
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1306, in load                                                                                                                          
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     return _jit_compile(                                                                                                                                                                                                                     
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr            ^^^^^^^^^^^^^                                                                                                                                                                                                                     
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile                                                                                                                  
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     _write_ninja_file_and_build_library(                                                                                                                                                                                                     
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library                                                                                           
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     _run_ninja_build(                                                                                                                                                                                                                        
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2112, in _run_ninja_build                                                                                                              
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     raise RuntimeError(message) from e                                                                                                                                                                                                       
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr RuntimeError: Error building extension 'exllamav2_ext': [1/28] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output quantize.cuda.o.d -DTORCH_EXTENSION_NAME=exllamav2_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYB
IND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/build/backend/python/exllama2/exllamav2/exllamav2_ext -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/inclu
de/torch/csrc/api/include -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include/TH -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/transformers/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D_
_CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -lineinfo -O3 -std=c++17 -c /build/backend/python/exllama2/exlla
mav2/exllamav2_ext/cuda/quantize.cu -o quantize.cuda.o                                                                                                                                                                                                                                                                     
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr FAILED: quantize.cuda.o                                                                                                                                                                                                                      
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output quantize.cuda.o.d -DTORCH_EXTENSION_NAME=exllamav2_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -
DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/build/backend/python/exllama2/exllamav2/exllamav2_ext -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/transformers
/lib/python3.11/site-packages/torch/include/TH -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/transformers/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CU
DA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -lineinfo -O3 -std=c++17 -c /build/backend/python/exllama2/exllamav2/exllamav2_ext/cuda/quantize.cu -o quantize.cuda.o         
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr /build/backend/python/exllama2/exllamav2/exllamav2_ext/cuda/quantize.cu:3:10: fatal error: curand_kernel.h: No such file or directory
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr     3 | #include <curand_kernel.h>
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr       |          ^~~~~~~~~~~~~~~~~
api-1  | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr compilation terminated.

Expected behavior Works without any hiccups

Logs already provided

Additional context none

Am I missing something here?

running vLLM on its own works with gpu based docker

EDIT: running with REBUILD=TRUE crashes the container:

api-1  | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=38e4ec0" -X "github.com/go-skynet/LocalAI/internal.Commit=38e4ec0b2a00c94bdffe74a8eabb6356aca795be"
api-1  | CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=38e4ec0" -X "github.com/go-skynet/LocalAI/internal.Commit=38e4ec0b2a00c94bdffe74a8eabb6356aca795be"" -tags "" -o local-ai ./
api-1  | # encoding/xml
api-1  | /usr/local/go/src/encoding/xml/read.go:322:32: internal compiler error: '(*Decoder).unmarshal': panic during schedule while compiling (*Decoder).unmarshal:
api-1  | 
api-1  | runtime error: invalid memory address or nil pointer dereference
api-1  | 
api-1  | goroutine 17 [running]:
api-1  | cmd/compile/internal/ssa.Compile.func1()
api-1  |        cmd/compile/internal/ssa/compile.go:49 +0x6c
api-1  | panic({0xcee280?, 0x1397c80?})
api-1  |        runtime/panic.go:914 +0x21f
api-1  | cmd/compile/internal/ssa.schedule(0xc00204e680)
api-1  |        cmd/compile/internal/ssa/schedule.go:249 +0xf5b
api-1  | cmd/compile/internal/ssa.Compile(0xc00204e680)
api-1  |        cmd/compile/internal/ssa/compile.go:97 +0x9ab
api-1  | cmd/compile/internal/ssagen.buildssa(0xc000fb6000, 0x2)
api-1  |        cmd/compile/internal/ssagen/ssa.go:568 +0x2ae9
api-1  | cmd/compile/internal/ssagen.Compile(0xc000fb6000, 0x0?)
api-1  |        cmd/compile/internal/ssagen/pgen.go:187 +0x45
api-1  | cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
api-1  |        cmd/compile/internal/gc/compile.go:184 +0x34
api-1  | cmd/compile/internal/gc.compileFunctions.func3.1()
api-1  |        cmd/compile/internal/gc/compile.go:166 +0x30
api-1  | created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 11
api-1  |        cmd/compile/internal/gc/compile.go:165 +0x23a
api-1  | 
api-1  | 
api-1  | 
api-1  | Please file a bug report including a short program that triggers the error.
api-1  | https://go.dev/issue/new
api-1  | make: *** [Makefile:308: build] Error 1
api-1 exited with code 2

Feb 04 '24 09:02 short-circuit

Looking at your error logs, we have the following:

api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr /build/backend/python/exllama2/exllamav2/exllamav2_ext/cuda/quantize.cu:3:10: fatal error: curand_kernel.h: No such file or directory api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr 3 | #include <curand_kernel.h>

The standard Docker container doesn't include CUDA to my knowledge. If you go to https://localai.io/advanced/#extra-backends you'll see references to the ones that include CUDA, such as: quay.io/go-skynet/local-ai:v2.6.0-cublas-cuda12

There are many tags available for a particular image. It may take a bit of hunting to get the one you need for your use case. There are a lot of things to keep in mind when building your own images, that if I were you, I wouldn't bother.

Feb 08 '24 10:02 TheDarkTrumpet

I have the cuda tagged image and I can use gguf with offloading to GPU just fine. It's more a problem with GRPC backends not loading correctly.

Feb 10 '24 14:02 short-circuit