llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

cuBLAS fails on same model - terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file

Open dynamite9999 opened this issue 1 year ago • 5 comments

Hello, I tried to see if anyone else had this issue but closest was #1596 and seems different in my situation.

Crashes only for GPU cuBLAS

  1. make LLAMA_CUBLAS=1

  2. Makefile edited NVCCFLAGS = --forward-unknown-to-host-compiler --gpu-architecture=sm_86 3 ./main -m ./models/ggml-vicuna-13b-1.1-q4_1.bin -p "Building a website can be done in 10 simple steps:" -n 512 main: build = 0 (unknown) main: seed = 1685251133 llama.cpp: loading model from ./models/ggml-vicuna-13b-1.1-q4_1.bin terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file Aborted (core dumped) root@netaisyslog:~/app/app/netai_llm#

But for everything else same, make clean and plain make - No GPU

it works fine.

Any guidance on how to go about figuring out what is going on here ?

thanks

Background

nvidia-smi

nvidia-smi

Sun May 28 05:24:26 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A40-4Q On | 00000000:04:00.0 Off | 0 | | N/A N/A P8 N/A / N/A | 0MiB / 4096MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ root@netaisyslog:~/app/app/netai_llm#

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0 root@netaisyslog:~/app/app/netai_llm#

Makefile

cat Makefile

Define the default target now so that it is always the first target

BUILD_TARGETS = main quantize quantize-stats perplexity embedding vdot

ifdef LLAMA_BUILD_SERVER BUILD_TARGETS += server endif

default: $(BUILD_TARGETS)

ifndef UNAME_S UNAME_S := $(shell uname -s) endif

ifndef UNAME_P UNAME_P := $(shell uname -p) endif

ifndef UNAME_M UNAME_M := $(shell uname -m) endif

CCV := $(shell $(CC) --version | head -n 1) CXXV := $(shell $(CXX) --version | head -n 1)

Mac OS + Arm can report x86_64

ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789

ifeq ($(UNAME_S),Darwin) ifneq ($(UNAME_P),arm) SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null) ifeq ($(SYSCTL_M),1) # UNAME_P := arm # UNAME_M := arm64 warn := $(warning Your arch is announced as x86_64, but it seems to actually be ARM64. Not fixing that can lead to bad performance. For more info see: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789) endif endif endif

Compile flags

keep standard at C11 and C++11

CFLAGS = -I. -O3 -std=c11 -fPIC CXXFLAGS = -I. -I./examples -O3 -std=c++11 -fPIC LDFLAGS =

ifndef LLAMA_DEBUG CFLAGS += -DNDEBUG CXXFLAGS += -DNDEBUG endif

warnings

CFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith CXXFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar

OS specific

TODO: support Windows

ifeq ($(UNAME_S),Linux) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),Darwin) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),FreeBSD) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),NetBSD) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),OpenBSD) CFLAGS += -pthread CXXFLAGS += -pthread endif ifeq ($(UNAME_S),Haiku) CFLAGS += -pthread CXXFLAGS += -pthread endif

ifdef LLAMA_GPROF CFLAGS += -pg CXXFLAGS += -pg endif ifdef LLAMA_PERF CFLAGS += -DGGML_PERF CXXFLAGS += -DGGML_PERF endif

Architecture specific

TODO: probably these flags need to be tweaked on some architectures

feel free to update the Makefile for your architecture and send a pull request or issue

ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686)) # Use all CPU extensions that are available: CFLAGS += -march=native -mtune=native CXXFLAGS += -march=native -mtune=native

# Usage AVX-only
#CFLAGS   += -mfma -mf16c -mavx
#CXXFLAGS += -mfma -mf16c -mavx

endif ifneq ($(filter ppc64%,$(UNAME_M)),) POWER9_M := $(shell grep "POWER9" /proc/cpuinfo) ifneq (,$(findstring POWER9,$(POWER9_M))) CFLAGS += -mcpu=power9 CXXFLAGS += -mcpu=power9 endif # Require c++23's std::byteswap for big-endian support. ifeq ($(UNAME_M),ppc64) CXXFLAGS += -std=c++23 -DGGML_BIG_ENDIAN endif endif ifndef LLAMA_NO_ACCELERATE # Mac M1 - include Accelerate framework. # -framework Accelerate works on Mac Intel as well, with negliable performance boost (as of the predict time). ifeq ($(UNAME_S),Darwin) CFLAGS += -DGGML_USE_ACCELERATE LDFLAGS += -framework Accelerate endif endif ifdef LLAMA_OPENBLAS CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -I/usr/include/openblas ifneq ($(shell grep -e "Arch Linux" -e "ID_LIKE=arch" /etc/os-release 2>/dev/null),) LDFLAGS += -lopenblas -lcblas else LDFLAGS += -lopenblas endif endif ifdef LLAMA_BLIS CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/blis -I/usr/include/blis LDFLAGS += -lblis -L/usr/local/lib endif ifdef LLAMA_CUBLAS CFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I$(CUDA_PATH)/targets/x86_64-linux/include CXXFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I$(CUDA_PATH)/targets/x86_64-linux/include LDFLAGS += -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L$(CUDA_PATH)/targets/x86_64-linux/lib OBJS += ggml-cuda.o NVCC = nvcc

NVCCFLAGS = --forward-unknown-to-host-compiler -arch=native

NVCCFLAGS = --forward-unknown-to-host-compiler  --gpu-architecture=sm_86

ifdef LLAMA_CUDA_DMMV_X NVCCFLAGS += -DGGML_CUDA_DMMV_X=$(LLAMA_CUDA_DMMV_X) else NVCCFLAGS += -DGGML_CUDA_DMMV_X=32 endif # LLAMA_CUDA_DMMV_X ifdef LLAMA_CUDA_DMMV_Y NVCCFLAGS += -DGGML_CUDA_DMMV_Y=$(LLAMA_CUDA_DMMV_Y) else NVCCFLAGS += -DGGML_CUDA_DMMV_Y=1 endif # LLAMA_CUDA_DMMV_Y ggml-cuda.o: ggml-cuda.cu ggml-cuda.h $(NVCC) $(NVCCFLAGS) $(CXXFLAGS) -Wno-pedantic -c $< -o $@ endif # LLAMA_CUBLAS ifdef LLAMA_CLBLAST CFLAGS += -DGGML_USE_CLBLAST CXXFLAGS += -DGGML_USE_CLBLAST # Mac provides OpenCL as a framework ifeq ($(UNAME_S),Darwin) LDFLAGS += -lclblast -framework OpenCL else LDFLAGS += -lclblast -lOpenCL endif OBJS += ggml-opencl.o ggml-opencl.o: ggml-opencl.cpp ggml-opencl.h $(CXX) $(CXXFLAGS) -c $< -o $@ endif ifneq ($(filter aarch64%,$(UNAME_M)),) # Apple M1, M2, etc. # Raspberry Pi 3, 4, Zero 2 (64-bit) CFLAGS += -mcpu=native CXXFLAGS += -mcpu=native endif ifneq ($(filter armv6%,$(UNAME_M)),) # Raspberry Pi 1, Zero CFLAGS += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access endif ifneq ($(filter armv7%,$(UNAME_M)),) # Raspberry Pi 2 CFLAGS += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations endif ifneq ($(filter armv8%,$(UNAME_M)),) # Raspberry Pi 3, 4, Zero 2 (32-bit) CFLAGS += -mfp16-format=ieee -mno-unaligned-access endif

Print build information

$(info I llama.cpp build info: ) $(info I UNAME_S: $(UNAME_S)) $(info I UNAME_P: $(UNAME_P)) $(info I UNAME_M: $(UNAME_M)) $(info I CFLAGS: $(CFLAGS)) $(info I CXXFLAGS: $(CXXFLAGS)) $(info I LDFLAGS: $(LDFLAGS)) $(info I CC: $(CCV)) $(info I CXX: $(CXXV)) $(info )

Build library

ggml.o: ggml.c ggml.h ggml-cuda.h $(CC) $(CFLAGS) -c $< -o $@

llama.o: llama.cpp ggml.h ggml-cuda.h llama.h llama-util.h $(CXX) $(CXXFLAGS) -c $< -o $@

common.o: examples/common.cpp examples/common.h $(CXX) $(CXXFLAGS) -c $< -o $@

libllama.so: llama.o ggml.o $(OBJS) $(CXX) $(CXXFLAGS) -shared -fPIC -o $@ $^ $(LDFLAGS)

clean: rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server vdot build-info.h

Examples

main: examples/main/main.cpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS) @echo @echo '==== Run ./main -h for help. ====' @echo

quantize: examples/quantize/quantize.cpp build-info.h ggml.o llama.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

quantize-stats: examples/quantize-stats/quantize-stats.cpp build-info.h ggml.o llama.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

perplexity: examples/perplexity/perplexity.cpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

embedding: examples/embedding/embedding.cpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

save-load-state: examples/save-load-state/save-load-state.cpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

server: examples/server/server.cpp examples/server/httplib.h examples/server/json.hpp build-info.h ggml.o llama.o common.o $(OBJS) $(CXX) $(CXXFLAGS) -Iexamples/server $(filter-out %.h,$(filter-out %.hpp,$^)) -o $@ $(LDFLAGS)

build-info.h: $(wildcard .git/index) scripts/build-info.sh @sh scripts/build-info.sh > [email protected] @if ! cmp -s [email protected] $@; then
mv [email protected] $@;
else
rm [email protected];
fi

Tests

benchmark-matmult: examples/benchmark/benchmark-matmult.cpp build-info.h ggml.o $(OBJS) $(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS) ./$@

vdot: pocs/vdot/vdot.cpp ggml.o $(OBJS) $(CXX) $(CXXFLAGS) $^ -o $@ $(LDFLAGS)

.PHONY: tests clean tests: bash ./tests/run-tests.sh root@netaisyslog:~/app/app/netai_llm#

dynamite9999 avatar May 28 '23 05:05 dynamite9999

If you got it from a source the one that starts with an H, the first thing would be to check if the SHA256 matches the file you downloaded. It's possible the file you have is truncated/corrupt.

llama.cpp mmaps models by default which I think will probably be more tolerant of something like an incomplete model. I bet if you run without GPU and --no-mmap you'll get an error.

KerfuffleV2 avatar May 28 '23 13:05 KerfuffleV2

I have my doubts on the error report. I'd like to see a fresh compiled CPU version that "works fine" with your model. There have been changes in 4_1 as well as in 8_0 so my guess is that you use an old model binary and you tested it with the old version, now you recompiled in GPU version and it doesn't work anymore because it's not compatible and the magic changes are not implemented to warn/abort.

cmp-nct avatar May 28 '23 14:05 cmp-nct

I also had this error and resolved it with help of CRD716 on discord. Essentially the model is too old. Yes, I mean you from six-weeks-ago.

asctime avatar May 31 '23 21:05 asctime

I get the same error even on a CPU. I picked up an earlier code from April and it works fine but the new code does not. @asctime which ggml model should we use then? q4_0 doesn't work? In that case, the readme front page needs to change.

aiaicode avatar Jun 02 '23 08:06 aiaicode

@aiaicode:

CRD716 — 05/31/2023 9:31 AM Yeah, there was a breaking change to models recently, try a new one.

The latest 7B from IlyaGusev seem to parse ok but I haven't had time to test the training.

asctime avatar Jun 02 '23 10:06 asctime

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 09 '24 01:04 github-actions[bot]