llama.cpp
llama.cpp copied to clipboard
[User] Generating embeddings is not using GPU when built with LLAMA_METAL=ON
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I've been trying out the metal implementation on an M1 mac, and main
is working fine, but I would also like to be able to get embeddings. Accelerating this with metal would be fantastic for me.
I tried to understand what would need to change, but I'm not conversent enough with the code to figure it out. Happy to try to make the changes myself and submit a PR if that would be helpful.
Current Behavior
As far as I can tell, embeddings
does not use metal. At least, the GPU usage stays at 0% when I give the -ngl 1
parameter.
I should also mention that using the llama-cpp-python
wrapper to get embeddings also does not use GPU, while a 'normal' inference of the model does.
I haven't tested if this is the case with a CUDA backend, but I can do if that is useful information.
Environment and Context
I'm running on a 32GB M1 macbook pro python = Python 3.10.10 make = GNU Make 3.81 cmake = cmake version 3.25.2 g++ = Apple clang version 14.0.0 (clang-1400.0.29.202) Target: arm64-apple-darwin22.5.0
Failure Information (for bugs)
I'm running
./bin/embedding -f abs -c 1024 -ngl 1 -m ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
content of abs is:
Long noncoding RNAs (lncRNAs) regulate gene expression via their RNA product or through transcriptional interference, yet a strategy to differentiate these two processes is lacking. To address this, we used multiple small interfering RNAs (siRNAs) to silence GNG12-AS1, a nuclear lncRNA transcribed in an antisense orientation to the tumour-suppressor DIRAS3. Here we show that while most siRNAs silence GNG12-AS1 post-transcriptionally, siRNA complementary to exon 1 of GNG12-AS1 suppresses its transcription by recruiting Argonaute 2 and inhibiting RNA polymerase II binding. Transcriptional, but not post-transcriptional, silencing of GNG12-AS1 causes concomitant upregulation of DIRAS3, indicating a function in transcriptional interference. This change in DIRAS3 expression is sufficient to impair cell cycle progression. In addition, the reduction in GNG12-AS1 transcripts alters MET signalling and cell migration, but these are independent of DIRAS3. Thus, differential siRNA targeting of a lncRNA allows dissection of the functions related to the process and products of its transcription.
Steps to Reproduce
build with cmake ../ -DLLAMA_METAL=ON -DBUILD_SHARED_LIBS=ON
(shared libs is to workaround an issue with the python binding - hopefully not relevant to this)
run ./bin/embedding -f abs -c 1024 -ngl 1 -m ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
Failure Logs
Metal does appear to be loading, and I get embeddings, but no GPU usage
main: build = 635 (5c64a09)
main: seed = 1686154509
llama.cpp: loading model from ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1024
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size = 800.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading './ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x132e063b0
ggml_metal_init: loaded kernel_mul 0x132e06ad0
ggml_metal_init: loaded kernel_mul_row 0x132f08330
ggml_metal_init: loaded kernel_scale 0x132e06ed0
ggml_metal_init: loaded kernel_silu 0x132e073f0
ggml_metal_init: loaded kernel_relu 0x132e07910
ggml_metal_init: loaded kernel_soft_max 0x132f08b90
ggml_metal_init: loaded kernel_diag_mask_inf 0x132e07fb0
ggml_metal_init: loaded kernel_get_rows_f16 0x132f09110
ggml_metal_init: loaded kernel_get_rows_q4_0 0x132e08650
ggml_metal_init: loaded kernel_rms_norm 0x132e08eb0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x132e099f0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x100f056b0
ggml_metal_init: loaded kernel_rope 0x132e09570
ggml_metal_init: loaded kernel_cpy_f32_f16 0x132e0ad20
ggml_metal_init: loaded kernel_cpy_f32_f32 0x132e0b5d0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1024.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 802.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
[ Big matrix ]
llama_print_timings: load time = 27444.70 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token)
llama_print_timings: prompt eval time = 26736.64 ms / 602 tokens ( 44.41 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token)
llama_print_timings: total time = 27446.69 ms
I'm simply getting
llama.cpp: loading model from ./llms/guanaco-33B.bin
ggml_metal_init: allocating ggml_metal_init: using MPS ggml_metal_init: loading '(null)' ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."
??
@jacobfriedman Do you have ggml-metal.metal
in the bin
directory (or I guess next to wherever you're running embeddings
from)? If I move it out I get that error, and I saw the same thing with llama-cpp-python
wrapper until I saw this https://github.com/abetlen/llama-cpp-python/issues/317#issuecomment-1576970558
I wasn't running with Python. Will investigate in that thread, thank you for the direction
This issue was closed because it has been inactive for 14 days since being marked as stale.