llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[User] Generating embeddings is not using GPU when built with LLAMA_METAL=ON

Open afg1 opened this issue 1 year ago • 3 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I've been trying out the metal implementation on an M1 mac, and main is working fine, but I would also like to be able to get embeddings. Accelerating this with metal would be fantastic for me.

I tried to understand what would need to change, but I'm not conversent enough with the code to figure it out. Happy to try to make the changes myself and submit a PR if that would be helpful.

Current Behavior

As far as I can tell, embeddings does not use metal. At least, the GPU usage stays at 0% when I give the -ngl 1 parameter.

I should also mention that using the llama-cpp-python wrapper to get embeddings also does not use GPU, while a 'normal' inference of the model does.

I haven't tested if this is the case with a CUDA backend, but I can do if that is useful information.

Environment and Context

I'm running on a 32GB M1 macbook pro python = Python 3.10.10 make = GNU Make 3.81 cmake = cmake version 3.25.2 g++ = Apple clang version 14.0.0 (clang-1400.0.29.202) Target: arm64-apple-darwin22.5.0

Failure Information (for bugs)

I'm running ./bin/embedding -f abs -c 1024 -ngl 1 -m ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin

content of abs is:

Long noncoding RNAs (lncRNAs) regulate gene expression via their RNA product or through transcriptional interference, yet a strategy to differentiate these two processes is lacking. To address this, we used multiple small interfering RNAs (siRNAs) to silence GNG12-AS1, a nuclear lncRNA transcribed in an antisense orientation to the tumour-suppressor DIRAS3. Here we show that while most siRNAs silence GNG12-AS1 post-transcriptionally, siRNA complementary to exon 1 of GNG12-AS1 suppresses its transcription by recruiting Argonaute 2 and inhibiting RNA polymerase II binding. Transcriptional, but not post-transcriptional, silencing of GNG12-AS1 causes concomitant upregulation of DIRAS3, indicating a function in transcriptional interference. This change in DIRAS3 expression is sufficient to impair cell cycle progression. In addition, the reduction in GNG12-AS1 transcripts alters MET signalling and cell migration, but these are independent of DIRAS3. Thus, differential siRNA targeting of a lncRNA allows dissection of the functions related to the process and products of its transcription.

Steps to Reproduce

build with cmake ../ -DLLAMA_METAL=ON -DBUILD_SHARED_LIBS=ON

(shared libs is to workaround an issue with the python binding - hopefully not relevant to this)

run ./bin/embedding -f abs -c 1024 -ngl 1 -m ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin

Failure Logs

Metal does appear to be loading, and I get embeddings, but no GPU usage

main: build = 635 (5c64a09)
main: seed  = 1686154509
llama.cpp: loading model from ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size  =  800.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading './ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x132e063b0
ggml_metal_init: loaded kernel_mul                            0x132e06ad0
ggml_metal_init: loaded kernel_mul_row                        0x132f08330
ggml_metal_init: loaded kernel_scale                          0x132e06ed0
ggml_metal_init: loaded kernel_silu                           0x132e073f0
ggml_metal_init: loaded kernel_relu                           0x132e07910
ggml_metal_init: loaded kernel_soft_max                       0x132f08b90
ggml_metal_init: loaded kernel_diag_mask_inf                  0x132e07fb0
ggml_metal_init: loaded kernel_get_rows_f16                   0x132f09110
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x132e08650
ggml_metal_init: loaded kernel_rms_norm                       0x132e08eb0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x132e099f0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x100f056b0
ggml_metal_init: loaded kernel_rope                           0x132e09570
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x132e0ad20
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x132e0b5d0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   802.00 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB

[ Big matrix ]

llama_print_timings:        load time = 27444.70 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 26736.64 ms /   602 tokens (   44.41 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 27446.69 ms

afg1 avatar Jun 07 '23 19:06 afg1

I'm simply getting

llama.cpp: loading model from ./llms/guanaco-33B.bin

ggml_metal_init: allocating ggml_metal_init: using MPS ggml_metal_init: loading '(null)' ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."

??

jacobfriedman avatar Jun 08 '23 00:06 jacobfriedman

@jacobfriedman Do you have ggml-metal.metal in the bin directory (or I guess next to wherever you're running embeddings from)? If I move it out I get that error, and I saw the same thing with llama-cpp-python wrapper until I saw this https://github.com/abetlen/llama-cpp-python/issues/317#issuecomment-1576970558

afg1 avatar Jun 08 '23 09:06 afg1

I wasn't running with Python. Will investigate in that thread, thank you for the direction

jacobfriedman avatar Jun 08 '23 14:06 jacobfriedman

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]