llama.cpp
llama.cpp copied to clipboard
Applying lora with CUDA crashes with failed assertion
-
[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
-
Running with CPU only with lora runs fine.
$ ./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --lora /data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin --lora-base /data/LLaMA/13B/ggml-model-f16.bin main: build = 669 (9254920) main: seed = 1686722870 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 13189.95 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 15237.95 MB (+ 1608.00 MB per state) llama_model_load_internal: offloading 0 layers to GPU llama_model_load_internal: total VRAM used: 512 MB .................................................................................................... llama_init_from_file: kv self size = 400.00 MB llama_apply_lora_from_file_internal: applying lora adapter from '/data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin' - please wait ... llama_apply_lora_from_file_internal: r = 96, alpha = 192, scaling = 2.00 llama_apply_lora_from_file_internal: loading base model from '/data/LLaMA/13B/ggml-model-f16.bin' llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin .................... done (64362.93 ms)
system_info: n_threads = 18 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = 25, n_keep = 0
This is a test prompted the voice. It said other things, but I couldn't understand them or remember them later if they were important. llama_print_timings: load time = 70609.41 ms llama_print_timings: sample time = 23.21 ms / 25 runs ( 0.93 ms per token) llama_print_timings: prompt eval time = 688.94 ms / 6 tokens ( 114.82 ms per token) llama_print_timings: eval time = 6819.37 ms / 24 runs ( 284.14 ms per token) llama_print_timings: total time = 7542.23 ms
- Running same command with GPU offload and NO lora works:
./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --n-gpu-layers 30 main: build = 669 (9254920) main: seed = 1686723899 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin [snip] llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 5594.59 MB (+ 1608.00 MB per state) llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer llama_model_load_internal: offloading 30 layers to GPU llama_model_load_internal: total VRAM used: 10156 MB .................................................................................................... llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 18 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = 25, n_keep = 0
This is a test prompt for the title. You are reading "Testing the Title" [end of text]
llama_print_timings: load time = 4321.42 ms llama_print_timings: sample time = 7.74 ms / 15 runs ( 0.52 ms per token) llama_print_timings: prompt eval time = 403.46 ms / 6 tokens ( 67.24 ms per token) llama_print_timings: eval time = 1738.15 ms / 14 runs ( 124.15 ms per token) llama_print_timings: total time = 2153.10 ms
- Running with lora AND with ANY number of layers offloaded to GPU causes crash with assertion failed
$ ./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --lora /data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin --lora-base /data/LLaMA/13B/ggml-model-f16.bin --n-gpu-layers 1 [snip] llama_model_load_internal: ggml ctx size = 13189.95 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 14916.51 MB (+ 1608.00 MB per state) llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer llama_model_load_internal: offloading 1 layers to GPU llama_model_load_internal: total VRAM used: 834 MB .................................................................................................... llama_init_from_file: kv self size = 400.00 MB llama_apply_lora_from_file_internal: applying lora adapter from '/data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin' - please wait ... llama_apply_lora_from_file_internal: r = 96, alpha = 192, scaling = 2.00 llama_apply_lora_from_file_internal: loading base model from '/data/LLaMA/13B/ggml-model-f16.bin' llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin ...................GGML_ASSERT: ggml.c:14307: tensor->src1 == NULL || tensor->src1->backend == GGML_BACKEND_CPU Aborted (core dumped)
Same problem here. I'm on ubuntu 20.04 using nvidia driver version: 530.30.02, CUDA version: 12.1, and an M40 GPU. Just as a possible solution I've tried totally wiping nvidia drivers and cuda from my system, doing a reinstall of them, compiling llama.gcc again and....no change. Still crashes if I use both a lora and GPU at the same time. Still works if I use GPU and no lora, or no GPU and a lora.
Just curious, does it still crash without --lora-base
?
Just curious, does it still crash without
--lora-base
?
For me at least, yep, I still get the crash if I don't use lora-base.
Weird. I was playing with LoRA earlier today and didn't have that issue (but I was only using cuBLAS for the prompt, not offloading layers). A big pull that changes CUDA stuff just got merged a couple minutes ago. You could try pulling and recompiling, see if it randomly fixed your issue.
I'm on it.
Just did a fresh pull, make clean and LLAMA_CUBLAS=1 make. No changes with the crash, I'm afraid. But thanks to everyone trying to figure it out!
I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would require quite a few changes to get good performance out of GPU-accelerated LoRAs. But if someone wants good performance they'll merge the LoRA anyways. Maybe once I implement better f16 support for something else I'll revisit this.
I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would require quite a few changes to get good performance out of GPU-accelerated LoRAs.
Thank you so much for digging into it! It's a relief just knowing what's going on there. This is the first I'd seen anyone else mention it, and I was really starting to think that I was messing something up somewhere.
Thanks for looking into this and finding the issue.
This issue was closed because it has been inactive for 14 days since being marked as stale.