ggml : add SSM Metal kernels

Open ggerganov opened this issue 1 year ago • 0 comments

target: #8526

Straightforward Metal implementation of SSM_CONV and SSM_SCAN using single-threaded kernels, mimicking the CPU implementation. Lot's of room for further optimizations, for now assuring correctness

./llama-batched \
  -m ./models/mamba-130m/ggml-model-f16.gguf \
  -p "Hello, my name is" -np 16 -n 32

main: n_predict = 32, n_ctx = 448, n_batch = 32, n_parallel = 16, n_kv_req = 437

Hello, my name is

main: generating 16 sequences ...

main: stream 0 finished at n_cur = 32
main: stream 1 finished at n_cur = 32
main: stream 2 finished at n_cur = 32
main: stream 3 finished at n_cur = 32
main: stream 4 finished at n_cur = 32
main: stream 5 finished at n_cur = 32
main: stream 6 finished at n_cur = 32
main: stream 7 finished at n_cur = 32
main: stream 8 finished at n_cur = 32
main: stream 9 finished at n_cur = 32
main: stream 10 finished at n_cur = 32
main: stream 11 finished at n_cur = 32
main: stream 12 finished at n_cur = 32
main: stream 13 finished at n_cur = 32
main: stream 14 finished at n_cur = 32
main: stream 15 finished at n_cur = 32

sequence 0:

Hello, my name is Tiffany. I'm a mother of three and a retired teacher. I'm a member of the American Indian and Alaska Native (AI

sequence 1:

Hello, my name is John. I am a freelance writer and editor. I have a passion for writing and have been writing since I was a child. I

sequence 2:

Hello, my name is Renee. I'm a full-time writer, and I'm currently working on a new book. I'm also a graduate

sequence 3:

Hello, my name is Jules. I'm a writer and illustrator. I have a passion for the arts and I love to travel. I love to

sequence 4:

Hello, my name is Renee. I am a single mom of two boys. I am trying to figure out how to make this work. I am

sequence 5:

Hello, my name is Dr. Sonia. I'm a doctor in the University of Medicine and Dentistry of New Jersey. I'm here to help you

sequence 6:

Hello, my name is Nick. I'm a member of the
  National Association of Women in the United States of America. I'm
  a member

sequence 7:

Hello, my name is Jadine. I'm a real person, and I'm here to help you. I'm here to help you get the best

sequence 8:

Hello, my name is Roxane and I'm a young woman with a love of all things chocolate. I've been a member of the Chocolate Club for

sequence 9:

Hello, my name is John. I'm a professional musician, and I'm looking for a new job. I'm a musician, and I'm looking for

sequence 10:

Hello, my name is Dr. Paul, and I'm a doctor in the area of cardiac surgery. I'm here to help you. I'm here to

sequence 11:

Hello, my name is Daniel and I'm a teacher in an elementary school in the United States. I've been reading about the dangers of the internet for the

sequence 12:

Hello, my name is Sven, and I'm a member of the Sven-Gustavsson Foundation. I'm here to talk about the future

sequence 13:

Hello, my name is Nico, I'm a professional photographer, I work in the studio of the famous photographer, Josef Krammer, who is

sequence 14:

Hello, my name is John. I'm a big fan of your work. I'm looking for a job. I'm looking for a good, honest man

sequence 15:

Hello, my name is John. I'm a newbie to the Internet, and I'm trying to learn how to use it.
I'm trying to

main: decoded 432 tokens in 0.71 s, speed: 609.55 t/s

llama_print_timings:        load time =     137.83 ms
llama_print_timings:      sample time =      10.18 ms /   448 runs   (    0.02 ms per token, 44025.16 tokens per second)
llama_print_timings: prompt eval time =     727.16 ms /   437 tokens (    1.66 ms per token,   600.97 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     845.80 ms /   438 tokens

ggml_metal_free: deallocating

./llama-perplexity \
  -m ./models/mamba-130m/ggml-model-f16.gguf \
  -f build/wikitext-2-raw/wiki.test.raw -ngl 99

perplexity: tokenizing the input ..
perplexity: tokenization took 950.02 ms
perplexity: calculating perplexity over 650 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.55 seconds per pass - ETA 1.48 minutes
...
Final estimate: PPL = 25.0894 +/- 0.18559

Jul 17 '24 18:07 ggerganov