ggml
ggml copied to clipboard
Starcoder mmap (and gpu) example
Not sure if this is really worthy of adding to the repo, but I have got mmap loading of starcoder based models working, this allows their use on systems with 16GB of ram where it wasn't possible before.
Have also added a few lines to run the layers that can be on the GPU with cuda or clblast (clblast version taken from koboldcpp). On my limited system this improves token latency from 380ms/t to 330ms (only 8GB so 20 layers offloaded).
I copied the mmap stuff mostly from https://github.com/ggerganov/llama.cpp/pull/613, have only tested on linux but it seems to work as expected.
Apologies if the changes are hard to review, I figured making a new file was cleaner then adding more changes that would only apply to this example to common.cpp/h to pass through options for turning mmap on and off. Is easier to read by diffing starcoder-mmap.cpp with main.cpp.
Cool!
Long term the goal is to move support for all models into llama.cpp
where we already have the mmap
and GPU machinery. And therefore, try to keep the ggml
examples simple and minimalistic.
But in any case this is useful - I'll think about if we want to merge it.
Awesome!
Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take?
I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp?
Awesome!
Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take?
I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp?
When I run the sh
./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin -p "You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in \"\`\`\`\". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"\n\n\nYour answer:\n\`\`\`" --top_k 0 --top_p 0.95 --temp 0
the GPU UTL seems not change(keep 0%) during inference.
is this correct?
./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin
Run ./starcoder-mmap if you have built this branch.
./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin
Run ./starcoder-mmap if you have built this branch.
thank you for the reply.
after making
cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc .. && make -j4 starcoder starcoder-quantize starcoder-mmap
CUDA Driver Version is 12.1
GPU is Tesla T4
run
./bin/starcoder-mmap -ngl 20 -t 24 -b 64 -m /model/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin -p "You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in \"\`\`\`\". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"\n\n\nYour answer:\n\`\`\`" --top_k 0 --top_p 0.95 --temp 0
the inference didn't work correct:
Calling starcoder_eval
You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in "```". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n """ Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n """\n\n\nYour answer:\n```<|endoftext|>
main: mem per token = 462284 bytes
main: load time = 5219.62 ms
main: sample time = 0.36 ms
main: predict time = 0.00 ms / -nan ms per token
main: total time = 18967.41 ms
When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes.
You can try a positive temperature if you would like more output from that particular prompt.
When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes.
You can try a positive temperature if you would like more output from that particular prompt.
It works well when temperature is 0.2.
Thank you for the awesome job. The acceleration effect is good (ngl 40: 300 ms/token -> 180 ms/token )
Are there any plans to support multiple GPU in the future
@JohannesGaessler
This branch demonstrates sample GPU inference of Starcoder. I just synced it with the latest CUDA code from master
and the inference breaks. I tried to trace it, and found that if I disable the mul_mat_vec_q
kernels it works:
diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index dc4b773..a0b4988 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -2459,7 +2459,7 @@ inline void ggml_cuda_op_mul_mat_vec(
src0->type == GGML_TYPE_Q5_1 ||
src0->type == GGML_TYPE_Q8_0;
- const bool use_mul_mat_vec_q = g_compute_capabilities[id] >= 610 && mul_mat_vec_q_implemented;
+ const bool use_mul_mat_vec_q = false;
#endif
if (use_mul_mat_vec_q) {
So it might indicate some issue in those kernels and it's probably worth looking into it.
Easiest steps to repro:
- Get
Q4_0
model from here: https://huggingface.co/TheBloke/Starcoderplus-Guanaco-GPT4-15B-V1.0-GGML/tree/main - Create a sample prompt file
p-prompt.txt
that contains the following:
### Human: Write a function to check a C string for valid UTF-8 encoding without using external libs in C++.
### Assistant: Sure, here's the function:
```cpp
- Build with
cmake -DGGML_CUBLAS=ON
- Run:
./bin/starcoder-mmap -t 8 -m models/starcoder/starcoderplus-guanaco-gpt4.ggmlv1.q4_0.bin -n 4096 --top_p 0.3 --temp 1 --top_k 9999 -f p-prompt.txt -s 123 -ngl 1
You need to offload just 1 layer to trigger the issue, but you can also offload more. The above steps currently generate gibberish and applying the single-line patch from above fixes it.