llama.cpp
llama.cpp copied to clipboard
cuBLAS fails on same model - terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file
Hello, I tried to see if anyone else had this issue but closest was #1596 and seems different in my situation.
Crashes only for GPU cuBLAS
-
make LLAMA_CUBLAS=1
-
Makefile edited NVCCFLAGS = --forward-unknown-to-host-compiler --gpu-architecture=sm_86 3 ./main -m ./models/ggml-vicuna-13b-1.1-q4_1.bin -p "Building a website can be done in 10 simple steps:" -n 512 main: build = 0 (unknown) main: seed = 1685251133 llama.cpp: loading model from ./models/ggml-vicuna-13b-1.1-q4_1.bin terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file Aborted (core dumped) root@netaisyslog:~/app/app/netai_llm#
But for everything else same, make clean and plain make - No GPU
it works fine.
Any guidance on how to go about figuring out what is going on here ?
thanks
Background
nvidia-smi
nvidia-smi
Sun May 28 05:24:26 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40-4Q On | 00000000:04:00.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ root@netaisyslog:~/app/app/netai_llm#
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0 root@netaisyslog:~/app/app/netai_llm#
Makefile
cat Makefile
Define the default target now so that it is always the first target
BUILD_TARGETS = main quantize quantize-stats perplexity embedding vdot
ifdef LLAMA_BUILD_SERVER BUILD_TARGETS += server endif
default: $(BUILD_TARGETS)
ifndef UNAME_S UNAME_S := $(shell uname -s) endif
ifndef UNAME_P UNAME_P := $(shell uname -p) endif
ifndef UNAME_M UNAME_M := $(shell uname -m) endif
CCV := $(shell $(CC) --version | head -n 1) CXXV := $(shell $(CXX) --version | head -n 1)
Mac OS + Arm can report x86_64
ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789
ifeq ($(UNAME_S),Darwin) ifneq ($(UNAME_P),arm) SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null) ifeq ($(SYSCTL_M),1) # UNAME_P := arm # UNAME_M := arm64 warn := $(warning Your arch is announced as x86_64, but it seems to actually be ARM64. Not fixing that can lead to bad performance. For more info see: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789) endif endif endif
Compile flags
keep standard at C11 and C++11
CFLAGS = -I. -O3 -std=c11 -fPIC CXXFLAGS = -I. -I./examples -O3 -std=c++11 -fPIC LDFLAGS =
ifndef LLAMA_DEBUG CFLAGS += -DNDEBUG CXXFLAGS += -DNDEBUG endif
warnings
CFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith CXXFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar
OS specific
TODO: support Windows
ifeq ($(UNAME_S),Linux) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),Darwin) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),FreeBSD) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),NetBSD) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),OpenBSD) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),Haiku) CFLAGS += -pthread CXXFLAGS += -pthread endif
ifdef LLAMA_GPROF CFLAGS += -pg CXXFLAGS += -pg endif ifdef LLAMA_PERF CFLAGS += -DGGML_PERF CXXFLAGS += -DGGML_PERF endif
Architecture specific
TODO: probably these flags need to be tweaked on some architectures
feel free to update the Makefile for your architecture and send a pull request or issue
ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686)) # Use all CPU extensions that are available: CFLAGS += -march=native -mtune=native CXXFLAGS += -march=native -mtune=native
# Usage AVX-only
#CFLAGS += -mfma -mf16c -mavx
#CXXFLAGS += -mfma -mf16c -mavx
endif
ifneq ($(filter ppc64%,$(UNAME_M)),)
POWER9_M := $(shell grep "POWER9" /proc/cpuinfo)
ifneq (,$(findstring POWER9,$(POWER9_M)))
CFLAGS += -mcpu=power9
CXXFLAGS += -mcpu=power9
endif
# Require c++23's std::byteswap for big-endian support.
ifeq ($(UNAME_M),ppc64)
CXXFLAGS += -std=c++23 -DGGML_BIG_ENDIAN
endif
endif
ifndef LLAMA_NO_ACCELERATE
# Mac M1 - include Accelerate framework.
# -framework Accelerate
works on Mac Intel as well, with negliable performance boost (as of the predict time).
ifeq ($(UNAME_S),Darwin)
CFLAGS += -DGGML_USE_ACCELERATE
LDFLAGS += -framework Accelerate
endif
endif
ifdef LLAMA_OPENBLAS
CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -I/usr/include/openblas
ifneq ($(shell grep -e "Arch Linux" -e "ID_LIKE=arch" /etc/os-release 2>/dev/null),)
LDFLAGS += -lopenblas -lcblas
else
LDFLAGS += -lopenblas
endif
endif
ifdef LLAMA_BLIS
CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/blis -I/usr/include/blis
LDFLAGS += -lblis -L/usr/local/lib
endif
ifdef LLAMA_CUBLAS
CFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I$(CUDA_PATH)/targets/x86_64-linux/include
CXXFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I$(CUDA_PATH)/targets/x86_64-linux/include
LDFLAGS += -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L$(CUDA_PATH)/targets/x86_64-linux/lib
OBJS += ggml-cuda.o
NVCC = nvcc
NVCCFLAGS = --forward-unknown-to-host-compiler -arch=native
NVCCFLAGS = --forward-unknown-to-host-compiler --gpu-architecture=sm_86
ifdef LLAMA_CUDA_DMMV_X NVCCFLAGS += -DGGML_CUDA_DMMV_X=$(LLAMA_CUDA_DMMV_X) else NVCCFLAGS += -DGGML_CUDA_DMMV_X=32 endif # LLAMA_CUDA_DMMV_X ifdef LLAMA_CUDA_DMMV_Y NVCCFLAGS += -DGGML_CUDA_DMMV_Y=$(LLAMA_CUDA_DMMV_Y) else NVCCFLAGS += -DGGML_CUDA_DMMV_Y=1 endif # LLAMA_CUDA_DMMV_Y ggml-cuda.o: ggml-cuda.cu ggml-cuda.h $(NVCC) $(NVCCFLAGS) $(CXXFLAGS) -Wno-pedantic -c $< -o $@ endif # LLAMA_CUBLAS ifdef LLAMA_CLBLAST CFLAGS += -DGGML_USE_CLBLAST CXXFLAGS += -DGGML_USE_CLBLAST # Mac provides OpenCL as a framework ifeq ($(UNAME_S),Darwin) LDFLAGS += -lclblast -framework OpenCL else LDFLAGS += -lclblast -lOpenCL endif OBJS += ggml-opencl.o ggml-opencl.o: ggml-opencl.cpp ggml-opencl.h $(CXX) $(CXXFLAGS) -c $< -o $@ endif ifneq ($(filter aarch64%,$(UNAME_M)),) # Apple M1, M2, etc. # Raspberry Pi 3, 4, Zero 2 (64-bit) CFLAGS += -mcpu=native CXXFLAGS += -mcpu=native endif ifneq ($(filter armv6%,$(UNAME_M)),) # Raspberry Pi 1, Zero CFLAGS += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access endif ifneq ($(filter armv7%,$(UNAME_M)),) # Raspberry Pi 2 CFLAGS += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations endif ifneq ($(filter armv8%,$(UNAME_M)),) # Raspberry Pi 3, 4, Zero 2 (32-bit) CFLAGS += -mfp16-format=ieee -mno-unaligned-access endif
Print build information
$(info I llama.cpp build info: ) $(info I UNAME_S: $(UNAME_S)) $(info I UNAME_P: $(UNAME_P)) $(info I UNAME_M: $(UNAME_M)) $(info I CFLAGS: $(CFLAGS)) $(info I CXXFLAGS: $(CXXFLAGS)) $(info I LDFLAGS: $(LDFLAGS)) $(info I CC: $(CCV)) $(info I CXX: $(CXXV)) $(info )
Build library
ggml.o: ggml.c ggml.h ggml-cuda.h $(CC) $(CFLAGS) -c $< -o $@
llama.o: llama.cpp ggml.h ggml-cuda.h llama.h llama-util.h $(CXX) $(CXXFLAGS) -c $< -o $@
common.o: examples/common.cpp examples/common.h $(CXX) $(CXXFLAGS) -c $< -o $@
libllama.so: llama.o ggml.o $(OBJS) $(CXX) $(CXXFLAGS) -shared -fPIC -o $@ $^ $(LDFLAGS)
clean: rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server vdot build-info.h
Examples
main: examples/main/main.cpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS) @echo @echo '==== Run ./main -h for help. ====' @echo
quantize: examples/quantize/quantize.cpp build-info.h ggml.o llama.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
quantize-stats: examples/quantize-stats/quantize-stats.cpp build-info.h ggml.o llama.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
perplexity: examples/perplexity/perplexity.cpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
embedding: examples/embedding/embedding.cpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
save-load-state: examples/save-load-state/save-load-state.cpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
server: examples/server/server.cpp examples/server/httplib.h examples/server/json.hpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) -Iexamples/server $(filter-out %.h,$(filter-out %.hpp,$^)) -o $@ $(LDFLAGS)
build-info.h: $(wildcard .git/index) scripts/build-info.sh
@sh scripts/build-info.sh > [email protected]
@if ! cmp -s [email protected] $@; then
mv [email protected] $@;
else
rm [email protected];
fi
Tests
benchmark-matmult: examples/benchmark/benchmark-matmult.cpp build-info.h ggml.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS) ./$@
vdot: pocs/vdot/vdot.cpp ggml.o $(OBJS) $(CXX) $(CXXFLAGS) $^ -o $@ $(LDFLAGS)
.PHONY: tests clean tests: bash ./tests/run-tests.sh root@netaisyslog:~/app/app/netai_llm#
If you got it from a source the one that starts with an H, the first thing would be to check if the SHA256 matches the file you downloaded. It's possible the file you have is truncated/corrupt.
llama.cpp mmap
s models by default which I think will probably be more tolerant of something like an incomplete model. I bet if you run without GPU and --no-mmap
you'll get an error.
I have my doubts on the error report. I'd like to see a fresh compiled CPU version that "works fine" with your model. There have been changes in 4_1 as well as in 8_0 so my guess is that you use an old model binary and you tested it with the old version, now you recompiled in GPU version and it doesn't work anymore because it's not compatible and the magic changes are not implemented to warn/abort.
I also had this error and resolved it with help of CRD716 on discord. Essentially the model is too old. Yes, I mean you from six-weeks-ago.
I get the same error even on a CPU. I picked up an earlier code from April and it works fine but the new code does not. @asctime which ggml model should we use then? q4_0 doesn't work? In that case, the readme front page needs to change.
@aiaicode:
CRD716 — 05/31/2023 9:31 AM Yeah, there was a breaking change to models recently, try a new one.
The latest 7B from IlyaGusev seem to parse ok but I haven't had time to test the training.
This issue was closed because it has been inactive for 14 days since being marked as stale.