No GPU GRPC Backend work
LocalAI version: Docker - LocalAI version v2.7.0-12-g38e4ec0 (38e4ec0b2a00c94bdffe74a8eabb6356aca795be)
Environment, CPU architecture, OS, and Version: 6.7.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 01 Feb 2024 10:30:35 +0000 x86_64 GNU/Linux CUDA 12, RTX 4090
Describe the bug None of the tested GRPC GPU using backends work
To Reproduce
- Build latest image.
- Use vLLM
- get the following error:
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr Traceback (most recent call last):
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr File "/build/backend/python/vllm/backend_vllm.py", line 13, in <module>
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr from vllm import LLM, SamplingParams
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/__init__.py", line 3, in <module>
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 6, in <module>
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/config.py", line 9, in <module>
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr from vllm.utils import get_cpu_memory, is_hip
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/utils.py", line 11, in <module>
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr from vllm._C import cuda_utils
api-1 | 9:47AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:44245): stderr ImportError: /opt/conda/envs/transformers/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops15to_dtype_layout4callERKNS_6TensorEN3c108optionalINS5_10ScalarTypeEEENS6_INS5_6LayoutEEENS6_INS5_6DeviceEEENS6_IbEEbbNS6_INS5_12MemoryFormatEEE
- Use exllama2
- get the following error:
api-1 | 9:45AM DBG GRPC Service for mistral-7b-v0.2.safetensors will be running at: '127.0.0.1:35159'
api-1 | 9:45AM DBG GRPC Service state dir: /tmp/go-processmanager1009413310
api-1 | 9:45AM DBG GRPC Service Started
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr Traceback (most recent call last):
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2096, in _run_ninja_build
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr subprocess.run(
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/opt/conda/envs/transformers/lib/python3.11/subprocess.py", line 571, in run
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr raise CalledProcessError(retcode, process.args,
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr The above exception was the direct cause of the following exception:
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr Traceback (most recent call last):
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/build/backend/python/exllama2/exllama2_backend.py", line 19, in <module>
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr from exllamav2.generator import (
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/build/backend/python/exllama2/exllamav2/__init__.py", line 3, in <module>
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr from exllamav2.model import ExLlamaV2
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/build/backend/python/exllama2/exllamav2/model.py", line 16, in <module>
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr from exllamav2.config import ExLlamaV2Config
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/build/backend/python/exllama2/exllamav2/config.py", line 2, in <module>
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr from exllamav2.fasttensors import STFile
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/build/backend/python/exllama2/exllamav2/fasttensors.py", line 5, in <module>
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr from exllamav2.ext import exllamav2_ext as ext_c
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/build/backend/python/exllama2/exllamav2/ext.py", line 142, in <module>
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr exllamav2_ext = load \
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr ^^^^^^
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1306, in load
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr return _jit_compile(
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr ^^^^^^^^^^^^^
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr _write_ninja_file_and_build_library(
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr _run_ninja_build(
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2112, in _run_ninja_build
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr raise RuntimeError(message) from e
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr RuntimeError: Error building extension 'exllamav2_ext': [1/28] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output quantize.cuda.o.d -DTORCH_EXTENSION_NAME=exllamav2_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYB
IND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/build/backend/python/exllama2/exllamav2/exllamav2_ext -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/inclu
de/torch/csrc/api/include -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include/TH -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/transformers/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D_
_CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -lineinfo -O3 -std=c++17 -c /build/backend/python/exllama2/exlla
mav2/exllamav2_ext/cuda/quantize.cu -o quantize.cuda.o
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr FAILED: quantize.cuda.o
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output quantize.cuda.o.d -DTORCH_EXTENSION_NAME=exllamav2_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -
DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/build/backend/python/exllama2/exllamav2/exllamav2_ext -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/transformers
/lib/python3.11/site-packages/torch/include/TH -isystem /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/transformers/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CU
DA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -lineinfo -O3 -std=c++17 -c /build/backend/python/exllama2/exllamav2/exllamav2_ext/cuda/quantize.cu -o quantize.cuda.o
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr /build/backend/python/exllama2/exllamav2/exllamav2_ext/cuda/quantize.cu:3:10: fatal error: curand_kernel.h: No such file or directory
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr 3 | #include <curand_kernel.h>
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr | ^~~~~~~~~~~~~~~~~
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr compilation terminated.
Expected behavior Works without any hiccups
Logs already provided
Additional context none
Am I missing something here?
running vLLM on its own works with gpu based docker
EDIT: running with REBUILD=TRUE crashes the container:
api-1 | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=38e4ec0" -X "github.com/go-skynet/LocalAI/internal.Commit=38e4ec0b2a00c94bdffe74a8eabb6356aca795be"
api-1 | CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=38e4ec0" -X "github.com/go-skynet/LocalAI/internal.Commit=38e4ec0b2a00c94bdffe74a8eabb6356aca795be"" -tags "" -o local-ai ./
api-1 | # encoding/xml
api-1 | /usr/local/go/src/encoding/xml/read.go:322:32: internal compiler error: '(*Decoder).unmarshal': panic during schedule while compiling (*Decoder).unmarshal:
api-1 |
api-1 | runtime error: invalid memory address or nil pointer dereference
api-1 |
api-1 | goroutine 17 [running]:
api-1 | cmd/compile/internal/ssa.Compile.func1()
api-1 | cmd/compile/internal/ssa/compile.go:49 +0x6c
api-1 | panic({0xcee280?, 0x1397c80?})
api-1 | runtime/panic.go:914 +0x21f
api-1 | cmd/compile/internal/ssa.schedule(0xc00204e680)
api-1 | cmd/compile/internal/ssa/schedule.go:249 +0xf5b
api-1 | cmd/compile/internal/ssa.Compile(0xc00204e680)
api-1 | cmd/compile/internal/ssa/compile.go:97 +0x9ab
api-1 | cmd/compile/internal/ssagen.buildssa(0xc000fb6000, 0x2)
api-1 | cmd/compile/internal/ssagen/ssa.go:568 +0x2ae9
api-1 | cmd/compile/internal/ssagen.Compile(0xc000fb6000, 0x0?)
api-1 | cmd/compile/internal/ssagen/pgen.go:187 +0x45
api-1 | cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
api-1 | cmd/compile/internal/gc/compile.go:184 +0x34
api-1 | cmd/compile/internal/gc.compileFunctions.func3.1()
api-1 | cmd/compile/internal/gc/compile.go:166 +0x30
api-1 | created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 11
api-1 | cmd/compile/internal/gc/compile.go:165 +0x23a
api-1 |
api-1 |
api-1 |
api-1 | Please file a bug report including a short program that triggers the error.
api-1 | https://go.dev/issue/new
api-1 | make: *** [Makefile:308: build] Error 1
api-1 exited with code 2
Looking at your error logs, we have the following:
api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr /build/backend/python/exllama2/exllamav2/exllamav2_ext/cuda/quantize.cu:3:10: fatal error: curand_kernel.h: No such file or directory api-1 | 9:45AM DBG GRPC(mistral-7b-v0.2.safetensors-127.0.0.1:35159): stderr 3 | #include <curand_kernel.h>
The standard Docker container doesn't include CUDA to my knowledge. If you go to https://localai.io/advanced/#extra-backends you'll see references to the ones that include CUDA, such as: quay.io/go-skynet/local-ai:v2.6.0-cublas-cuda12
There are many tags available for a particular image. It may take a bit of hunting to get the one you need for your use case. There are a lot of things to keep in mind when building your own images, that if I were you, I wouldn't bother.
I have the cuda tagged image and I can use gguf with offloading to GPU just fine. It's more a problem with GRPC backends not loading correctly.