llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Applying lora with CUDA crashes with failed assertion

Open d-takemori opened this issue 1 year ago • 9 comments

  • [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.

  • Running with CPU only with lora runs fine.

$ ./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --lora /data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin --lora-base /data/LLaMA/13B/ggml-model-f16.bin main: build = 669 (9254920) main: seed = 1686722870 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 13189.95 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 15237.95 MB (+ 1608.00 MB per state) llama_model_load_internal: offloading 0 layers to GPU llama_model_load_internal: total VRAM used: 512 MB .................................................................................................... llama_init_from_file: kv self size = 400.00 MB llama_apply_lora_from_file_internal: applying lora adapter from '/data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin' - please wait ... llama_apply_lora_from_file_internal: r = 96, alpha = 192, scaling = 2.00 llama_apply_lora_from_file_internal: loading base model from '/data/LLaMA/13B/ggml-model-f16.bin' llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin .................... done (64362.93 ms)

system_info: n_threads = 18 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = 25, n_keep = 0

This is a test prompted the voice. It said other things, but I couldn't understand them or remember them later if they were important. llama_print_timings: load time = 70609.41 ms llama_print_timings: sample time = 23.21 ms / 25 runs ( 0.93 ms per token) llama_print_timings: prompt eval time = 688.94 ms / 6 tokens ( 114.82 ms per token) llama_print_timings: eval time = 6819.37 ms / 24 runs ( 284.14 ms per token) llama_print_timings: total time = 7542.23 ms

  • Running same command with GPU offload and NO lora works:

./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --n-gpu-layers 30 main: build = 669 (9254920) main: seed = 1686723899 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin [snip] llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 5594.59 MB (+ 1608.00 MB per state) llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer llama_model_load_internal: offloading 30 layers to GPU llama_model_load_internal: total VRAM used: 10156 MB .................................................................................................... llama_init_from_file: kv self size = 400.00 MB

system_info: n_threads = 18 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = 25, n_keep = 0

This is a test prompt for the title. You are reading "Testing the Title" [end of text]

llama_print_timings: load time = 4321.42 ms llama_print_timings: sample time = 7.74 ms / 15 runs ( 0.52 ms per token) llama_print_timings: prompt eval time = 403.46 ms / 6 tokens ( 67.24 ms per token) llama_print_timings: eval time = 1738.15 ms / 14 runs ( 124.15 ms per token) llama_print_timings: total time = 2153.10 ms

  • Running with lora AND with ANY number of layers offloaded to GPU causes crash with assertion failed

$ ./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --lora /data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin --lora-base /data/LLaMA/13B/ggml-model-f16.bin --n-gpu-layers 1 [snip] llama_model_load_internal: ggml ctx size = 13189.95 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 14916.51 MB (+ 1608.00 MB per state) llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer llama_model_load_internal: offloading 1 layers to GPU llama_model_load_internal: total VRAM used: 834 MB .................................................................................................... llama_init_from_file: kv self size = 400.00 MB llama_apply_lora_from_file_internal: applying lora adapter from '/data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin' - please wait ... llama_apply_lora_from_file_internal: r = 96, alpha = 192, scaling = 2.00 llama_apply_lora_from_file_internal: loading base model from '/data/LLaMA/13B/ggml-model-f16.bin' llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin ...................GGML_ASSERT: ggml.c:14307: tensor->src1 == NULL || tensor->src1->backend == GGML_BACKEND_CPU Aborted (core dumped)

d-takemori avatar Jun 14 '23 06:06 d-takemori

Same problem here. I'm on ubuntu 20.04 using nvidia driver version: 530.30.02, CUDA version: 12.1, and an M40 GPU. Just as a possible solution I've tried totally wiping nvidia drivers and cuda from my system, doing a reinstall of them, compiling llama.gcc again and....no change. Still crashes if I use both a lora and GPU at the same time. Still works if I use GPU and no lora, or no GPU and a lora.

EmerJK avatar Jun 14 '23 15:06 EmerJK

Just curious, does it still crash without --lora-base?

KerfuffleV2 avatar Jun 14 '23 17:06 KerfuffleV2

Just curious, does it still crash without --lora-base?

For me at least, yep, I still get the crash if I don't use lora-base.

EmerJK avatar Jun 14 '23 17:06 EmerJK

Weird. I was playing with LoRA earlier today and didn't have that issue (but I was only using cuBLAS for the prompt, not offloading layers). A big pull that changes CUDA stuff just got merged a couple minutes ago. You could try pulling and recompiling, see if it randomly fixed your issue.

KerfuffleV2 avatar Jun 14 '23 18:06 KerfuffleV2

I'm on it.

JohannesGaessler avatar Jun 14 '23 18:06 JohannesGaessler

Just did a fresh pull, make clean and LLAMA_CUBLAS=1 make. No changes with the crash, I'm afraid. But thanks to everyone trying to figure it out!

EmerJK avatar Jun 14 '23 18:06 EmerJK

I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would require quite a few changes to get good performance out of GPU-accelerated LoRAs. But if someone wants good performance they'll merge the LoRA anyways. Maybe once I implement better f16 support for something else I'll revisit this.

JohannesGaessler avatar Jun 14 '23 19:06 JohannesGaessler

I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would require quite a few changes to get good performance out of GPU-accelerated LoRAs.

Thank you so much for digging into it! It's a relief just knowing what's going on there. This is the first I'd seen anyone else mention it, and I was really starting to think that I was messing something up somewhere.

EmerJK avatar Jun 14 '23 19:06 EmerJK

Thanks for looking into this and finding the issue.

d-takemori avatar Jun 15 '23 06:06 d-takemori

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]