refact
refact copied to clipboard
[bounty] CPU inference support, Mac M1/M2 inference support
There are several projects aiming to make inference on CPU efficient.
The first part is research:
- Which project works better,
- And compatible with Refact license,
- And doesn't bloat the docker too much,
- And allows to use scratchpads similar to how
inference_hf.py
does it (needs a callback that streams output and allows to stop), - Does it include Mac M1/M2 support, or does it make sense to address Mac separately.
Please finish the first part, get a "go-ahead" for the second part.
The second part is implementation:
- Script similar to
inference_hf.py
, - Little code,
- Not much dependencies,
- Demonstrate that it works with Refact-1.6b model, as well as StarCoder (at least the smaller sizes),
- Integration with UI and watchdog is a plus, but efficient inference is obviously the priority.
/bounty $2000
💎 $2,000 bounty created by olegklimov
🙋 If you start working on this, comment /attempt #77
to notify everyone
👉 To claim this bounty, submit a pull request that includes the text /claim #77
somewhere in its body
📝 Before proceeding, please make sure you can receive payouts in your country
💵 Payment arrives in your account 2-5 days after the bounty is rewarded
💯 You keep 100% of the bounty award
🙏 Thank you for contributing to smallcloudai/refact!
Attempt | Started (GMT+0) | Solution |
---|---|---|
🔴 @Akshay-Patel-dev | Aug 25, 2023, 11:44:51 PM | WIP |
🟢 @shobhit9957 | Aug 26, 2023, 10:38:57 AM | WIP |
🟢 @benxh1995 | Sep 4, 2023, 11:51:23 PM | WIP |
🟢 @ds5t5 | Sep 25, 2023, 1:52:54 AM | #122 |
/attempt #77 hey @olegklimov I would like to contribute.. can you please provide some more description about this project. I'm a beginner here...
Options
Note: The user @Akshay-Patel-dev is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @Akshay-Patel-dev will complete the issue first, and be awarded the bounty. We recommend discussing with @Akshay-Patel-dev and potentially collaborating on the same solution versus creating an alternate solution.
I'm a beginner here...
You can start with installing it and trying out.
But unless you already familiar with CPU inference libraries and LLMs in general, it might take you quite a long time to research.
I forked the project. And performed steps in the contributing.md file, but getting errors and unable to run it locally.
I added this , because in the error I encountered, this has to be added. install_requires=[ "triton>=12 0.0.3", ] in setup.py file, do you think adding this would be in the main branch is necessary?
CPU project names: ggml, ctransformers
/attempt #77
I've got a preliminary version working with ctransformers. Inference on my M1 Mac for Starcoder is almost impossibly slow. The Refact-1.6b model still doesn't have GGUF or GGML versions available. Any attempts to make my own quants have failed using the official quantization scripts.
I can have a codellama FIM 7B demo up and running soon.
Options
Note: The user @shobhit9957 is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @shobhit9957 will complete the issue first, and be awarded the bounty. We recommend discussing with @shobhit9957 and potentially collaborating on the same solution versus creating an alternate solution.
An interesting link: https://github.com/ggerganov/llama.cpp/discussions/2948 -- how to convert HuggingFace model to GGUF format
Example of GGUFs of all sizes: https://huggingface.co/TheBloke/Llama-2-7B-GGUF
@olegklimov
If this is still open, I might try it out.
Would the bounty claim still count for model conversion to GGUF format?
I understand it's first come, first serve. I'm just wondering if you're looking for a conversion script or if you just want general CPU support?
Quantization is a bit different from CPU inferencing and I'm just looking for clarity on the scope.
If you just want quantization, then I can look into creating a conversion script and I'll submit an attempt if I get it working and this is still open.
Hi @teleprint-me
Someone is trying the heavy lifting here: https://github.com/ggerganov/llama.cpp/issues/3061
@olegklimov
Yes, I saw that. That's why I'm asking.
I know that in order to do it, one would need to use the GGUF library to convert the tensors.
It would require a custom script, like the others that already exist in the llama.cpp repository.
Your original request was in reference to the inference_hf.py
script which is why I was asking for clarification.
@teleprint-me We are moving away from server-side scratchpads, in favor of client-side scratchpads. The plugins that can do it should land next week or a week after. There still has to be a script that takes the tasks to do, using completions_wait_batch()
(in inference_worker.py) and streams the results, but only a simple left-to-right completion will be required soon.
In short, the requirement "Script similar to inference_hf.py" can now read "Script similar to inference_hf.py, but only /v1/completions needs to work".
Script to test:
curl http://127.0.0.1:8008/v1/completions -k \
-H 'Content-Type: application/json' \
-d '{
"model": "smallcloudai/Refact-1_6b-fim",
"prompt": "def hello_world():\n \"\"\"\n This function prints \"Hello World!!!\" and brews coffee.\n \"\"\"",
"stream": true,
"echo": false,
"stop": ["\n\n"],
"temperature": 0.8,
"max_tokens": 50
}'
Stream and not stream should work, CPU output should be the same as current GPU output -- sounds like a well defined criterion.
@olegklimov
That's exactly what I was looking for, thank you for the update.
I'll be reviewing the other open bounties in the coming days as well.
Currently, I'm setting up a custom OS for my new workstation and finalizing the prototype interface for my personal assistant.
If I make significant progress that aligns with the criteria for any of the outstanding bounties, I'll submit an attempt and, if appropriate, a subsequent PR.
Given that I'm working against a deadline, I'm highly motivated to contribute efficiently and effectively.
💡 @ds5t5 submitted a pull request that claims the bounty. You can visit your org dashboard to reward. 👉 @ds5t5: To receive payouts, sign up on Algora, link your Github account and connect with Stripe on your dashboard.
Testing this:
./main -m ./Refact-1_6B-fim/ggml-model-f16.gguf -n 300 -p "write a function to multiple two integers in python" --temp 1.0 --top-p 1.0 --top-k 1 --repeat_penalty 1.0
I see speed:
- 17 tokens/s on my MacBook Air M1,
- 4 tokens/s on Intel Xeon Gold 5315Y @ 3.20GHz
Xeon 5315Y
Threads -t N | speed tokens/s |
---|---|
-t 2 | 6 |
-t 4 | 11 |
-t 8 | 11 |
-t 16 | 4 |
M1 doesn't depend on threads.
First token, 551 prompt:
- 1172ms on M1
- 25404ms on Xeon 5315Y
I'd say that's the main problem for adoption of this. 551-token prompt isn't even that big, normally we have about 1950 tokens.
I tried Starcoder 1b, converted by TabbyML:
https://huggingface.co/TabbyML/StarCoder-1B/tree/main/ggml
"-m", "starcoder-1b-q8_0.gguf",
897.71 ms / 557 tokens ( 1.61 ms per token, 620.47 tokens per second)
1334.68 ms / 49 runs ( 27.24 ms per token, 36.71 tokens per second)
"-m", "./starcoder-1b-f16.gguf",
841.99 ms / 557 tokens ( 1.51 ms per token, 661.53 tokens per second)
243.18 ms / 49 runs ( 45.78 ms per token, 21.84 tokens per second)
"-m", "./Refact-1_6B-fim/ggml-model-f16.gguf",
175.27 ms / 557 tokens ( 2.11 ms per token, 473.93 tokens per second)
962.51 ms / 49 runs ( 60.46 ms per token, 16.54 tokens per second)
@olegklimov I think it has to do with the conversion process. They're looking into it. Typically the smaller models are much faster in llama.cpp.
@olegklimov
- MacBook Air M1
Try the 4-bit model, you should see a performance boost compared to the 16-bit model.
4-bit
llama_print_timings: load time = 45.88 ms
llama_print_timings: sample time = 3.91 ms / 300 runs ( 0.01 ms per token, 76706.72 tokens per second)
llama_print_timings: prompt eval time = 56.82 ms / 9 tokens ( 6.31 ms per token, 158.38 tokens per second)
llama_print_timings: eval time = 6762.85 ms / 299 runs ( 22.62 ms per token, 44.21 tokens per second)
llama_print_timings: total time = 6933.22 ms
8-bit
llama_print_timings: load time = 71.79 ms
llama_print_timings: sample time = 3.72 ms / 300 runs ( 0.01 ms per token, 80623.49 tokens per second)
llama_print_timings: prompt eval time = 54.23 ms / 9 tokens ( 6.03 ms per token, 165.94 tokens per second)
llama_print_timings: eval time = 11387.12 ms / 299 runs ( 38.08 ms per token, 26.26 tokens per second)
llama_print_timings: total time = 11553.91 ms
16-bit
llama_print_timings: load time = 5828.46 ms
llama_print_timings: sample time = 4.17 ms / 300 runs ( 0.01 ms per token, 71856.29 tokens per second)
llama_print_timings: prompt eval time = 72.36 ms / 9 tokens ( 8.04 ms per token, 124.38 tokens per second)
llama_print_timings: eval time = 20573.06 ms / 299 runs ( 68.81 ms per token, 14.53 tokens per second)
llama_print_timings: total time = 20760.76 ms
Performance between the 16-bit and 32-bit converted tensor formats will perform the about the same on lower-end hardware.
Also, llama.cpp is still working on FIM implementation.
Quants are between 2-bit and 16-bit and support k-bit implementations if you aren't too familiar with the library or quant types.
OK it works nicely! So all the credit goes to @ds5t5, right?
@teleprint-me oh I see you've converted the 1.6b model in several quantizations, thank you for that! (I thought your tests were for llama, the name is confusing)
@ds5t5 Hi there!
We are going to slightly change modelling and weights respectively at the HF. The changes will include:
- combining
attn.k
andattn.v
intoattn.kv
- combining
mlp.linear_1
andmlp.linear_3
intomlp.gate_up_proj
Guess we need to update https://github.com/ggerganov/llama.cpp/pull/3329 as well
thanks. let me know when it is ready for model weight. i will rebase my llama.cpp PR to the latest branch of llama.cpp.
@JegernOUTT can i ask why we decided to make the weight change? it seems not quite aligned with other popular models. they (falcon, llama) usually keep mlp.linear_1 and mlp.linear_3 separately. while for attention, it is usually qkv or q/k/v. only the original gpt2 model uses kv as one.