ggml Starcoder mmap (and gpu) example

Not sure if this is really worthy of adding to the repo, but I have got mmap loading of starcoder based models working, this allows their use on systems with 16GB of ram where it wasn't possible before.

Have also added a few lines to run the layers that can be on the GPU with cuda or clblast (clblast version taken from koboldcpp). On my limited system this improves token latency from 380ms/t to 330ms (only 8GB so 20 layers offloaded).

I copied the mmap stuff mostly from https://github.com/ggerganov/llama.cpp/pull/613, have only tested on linux but it seems to work as expected.

Apologies if the changes are hard to review, I figured making a new file was cleaner then adding more changes that would only apply to this example to common.cpp/h to pass through options for turning mmap on and off. Is easier to read by diffing starcoder-mmap.cpp with main.cpp.

Jul 03 '23 23:07 johnson442

Cool!

Long term the goal is to move support for all models into llama.cpp where we already have the mmap and GPU machinery. And therefore, try to keep the ggml examples simple and minimalistic.

But in any case this is useful - I'll think about if we want to merge it.

Jul 04 '23 18:07 ggerganov

Awesome!

Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take?

I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp?

Jul 05 '23 01:07 johnson442

Awesome!

Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take?

I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp?

When I run the sh

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin -p "You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in \"\`\`\`\". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n\n\nYour answer:\n\`\`\`" --top_k 0 --top_p 0.95 --temp 0

the GPU UTL seems not change(keep 0%) during inference.

is this correct?

Jul 05 '23 01:07 llystar

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin

Run ./starcoder-mmap if you have built this branch.

Jul 05 '23 02:07 johnson442

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin

Run ./starcoder-mmap if you have built this branch.

thank you for the reply.

after making

cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc .. && make -j4 starcoder starcoder-quantize starcoder-mmap

CUDA Driver Version is 12.1 GPU is Tesla T4

run

./bin/starcoder-mmap -ngl 20 -t 24 -b 64 -m /model/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin -p "You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in \"\`\`\`\". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n\n\nYour answer:\n\`\`\`" --top_k 0 --top_p 0.95 --temp 0

the inference didn't work correct:

Calling starcoder_eval
You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in "```". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    """ Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    """\n\n\nYour answer:\n```<|endoftext|>

main: mem per token =   462284 bytes
main:     load time =  5219.62 ms
main:   sample time =     0.36 ms
main:  predict time =     0.00 ms / -nan ms per token
main:    total time = 18967.41 ms

Jul 05 '23 03:07 llystar

When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes.

You can try a positive temperature if you would like more output from that particular prompt.

Jul 05 '23 04:07 johnson442

When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes.

You can try a positive temperature if you would like more output from that particular prompt.

It works well when temperature is 0.2.

Thank you for the awesome job. The acceleration effect is good (ngl 40: 300 ms/token -> 180 ms/token )

Are there any plans to support multiple GPU in the future

Jul 05 '23 05:07 llystar

@JohannesGaessler

This branch demonstrates sample GPU inference of Starcoder. I just synced it with the latest CUDA code from master and the inference breaks. I tried to trace it, and found that if I disable the mul_mat_vec_q kernels it works:

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index dc4b773..a0b4988 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -2459,7 +2459,7 @@ inline void ggml_cuda_op_mul_mat_vec(
         src0->type == GGML_TYPE_Q5_1 ||
         src0->type == GGML_TYPE_Q8_0;
 
-    const bool use_mul_mat_vec_q = g_compute_capabilities[id] >= 610 && mul_mat_vec_q_implemented;
+    const bool use_mul_mat_vec_q = false;
 #endif
 
     if (use_mul_mat_vec_q) {

So it might indicate some issue in those kernels and it's probably worth looking into it.

Easiest steps to repro:

Get Q4_0 model from here: https://huggingface.co/TheBloke/Starcoderplus-Guanaco-GPT4-15B-V1.0-GGML/tree/main
Create a sample prompt file p-prompt.txt that contains the following:

### Human: Write a function to check a C string for valid UTF-8 encoding without using external libs in C++.
### Assistant: Sure, here's the function:
```cpp

Build with cmake -DGGML_CUBLAS=ON
Run:

./bin/starcoder-mmap -t 8 -m models/starcoder/starcoderplus-guanaco-gpt4.ggmlv1.q4_0.bin -n 4096 --top_p 0.3 --temp 1 --top_k 9999 -f p-prompt.txt -s 123 -ngl 1

You need to offload just 1 layer to trigger the issue, but you can also offload more. The above steps currently generate gibberish and applying the single-line patch from above fixes it.

Jul 14 '23 10:07 ggerganov

ggml ggml copied to clipboard

Starcoder mmap (and gpu) example

ggml
ggml copied to clipboard