refact icon indicating copy to clipboard operation
refact copied to clipboard

[bounty] CPU inference support, Mac M1/M2 inference support

Open olegklimov opened this issue 1 year ago • 45 comments

There are several projects aiming to make inference on CPU efficient.

The first part is research:

  • Which project works better,
  • And compatible with Refact license,
  • And doesn't bloat the docker too much,
  • And allows to use scratchpads similar to how inference_hf.py does it (needs a callback that streams output and allows to stop),
  • Does it include Mac M1/M2 support, or does it make sense to address Mac separately.

Please finish the first part, get a "go-ahead" for the second part.

The second part is implementation:

  • Script similar to inference_hf.py,
  • Little code,
  • Not much dependencies,
  • Demonstrate that it works with Refact-1.6b model, as well as StarCoder (at least the smaller sizes),
  • Integration with UI and watchdog is a plus, but efficient inference is obviously the priority.

olegklimov avatar Aug 25 '23 08:08 olegklimov

/bounty $2000

olegklimov avatar Aug 25 '23 15:08 olegklimov

💎 $2,000 bounty created by olegklimov 🙋 If you start working on this, comment /attempt #77 to notify everyone 👉 To claim this bounty, submit a pull request that includes the text /claim #77 somewhere in its body 📝 Before proceeding, please make sure you can receive payouts in your country 💵 Payment arrives in your account 2-5 days after the bounty is rewarded 💯 You keep 100% of the bounty award 🙏 Thank you for contributing to smallcloudai/refact!

Attempt Started (GMT+0) Solution
🔴 @Akshay-Patel-dev Aug 25, 2023, 11:44:51 PM WIP
🟢 @shobhit9957 Aug 26, 2023, 10:38:57 AM WIP
🟢 @benxh1995 Sep 4, 2023, 11:51:23 PM WIP
🟢 @ds5t5 Sep 25, 2023, 1:52:54 AM #122

algora-pbc[bot] avatar Aug 25 '23 17:08 algora-pbc[bot]

/attempt #77

Options

Akshay-Patel-dev avatar Aug 25 '23 23:08 Akshay-Patel-dev

/attempt #77 hey @olegklimov I would like to contribute.. can you please provide some more description about this project. I'm a beginner here...

Options

shobhit9957 avatar Aug 26 '23 10:08 shobhit9957

Note: The user @Akshay-Patel-dev is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @Akshay-Patel-dev will complete the issue first, and be awarded the bounty. We recommend discussing with @Akshay-Patel-dev and potentially collaborating on the same solution versus creating an alternate solution.

algora-pbc[bot] avatar Aug 26 '23 10:08 algora-pbc[bot]

I'm a beginner here...

You can start with installing it and trying out.

But unless you already familiar with CPU inference libraries and LLMs in general, it might take you quite a long time to research.

olegklimov avatar Aug 26 '23 11:08 olegklimov

I forked the project. And performed steps in the contributing.md file, but getting errors and unable to run it locally.

shobhit9957 avatar Aug 26 '23 11:08 shobhit9957

I added this , because in the error I encountered, this has to be added. install_requires=[ "triton>=12 0.0.3", ] in setup.py file, do you think adding this would be in the main branch is necessary?

shobhit9957 avatar Aug 26 '23 11:08 shobhit9957

CPU project names: ggml, ctransformers

olegklimov avatar Sep 04 '23 17:09 olegklimov

/attempt #77

I've got a preliminary version working with ctransformers. Inference on my M1 Mac for Starcoder is almost impossibly slow. The Refact-1.6b model still doesn't have GGUF or GGML versions available. Any attempts to make my own quants have failed using the official quantization scripts.

I can have a codellama FIM 7B demo up and running soon.

Options

benxh1995 avatar Sep 04 '23 23:09 benxh1995

Note: The user @shobhit9957 is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @shobhit9957 will complete the issue first, and be awarded the bounty. We recommend discussing with @shobhit9957 and potentially collaborating on the same solution versus creating an alternate solution.

algora-pbc[bot] avatar Sep 04 '23 23:09 algora-pbc[bot]

An interesting link: https://github.com/ggerganov/llama.cpp/discussions/2948 -- how to convert HuggingFace model to GGUF format

Example of GGUFs of all sizes: https://huggingface.co/TheBloke/Llama-2-7B-GGUF

olegklimov avatar Sep 09 '23 06:09 olegklimov

@olegklimov

If this is still open, I might try it out.

Would the bounty claim still count for model conversion to GGUF format?

I understand it's first come, first serve. I'm just wondering if you're looking for a conversion script or if you just want general CPU support?

Quantization is a bit different from CPU inferencing and I'm just looking for clarity on the scope.

If you just want quantization, then I can look into creating a conversion script and I'll submit an attempt if I get it working and this is still open.

teleprint-me avatar Sep 18 '23 14:09 teleprint-me

Hi @teleprint-me

Someone is trying the heavy lifting here: https://github.com/ggerganov/llama.cpp/issues/3061

olegklimov avatar Sep 21 '23 06:09 olegklimov

@olegklimov

Yes, I saw that. That's why I'm asking.

I know that in order to do it, one would need to use the GGUF library to convert the tensors.

It would require a custom script, like the others that already exist in the llama.cpp repository.

Your original request was in reference to the inference_hf.py script which is why I was asking for clarification.

teleprint-me avatar Sep 21 '23 21:09 teleprint-me

@teleprint-me We are moving away from server-side scratchpads, in favor of client-side scratchpads. The plugins that can do it should land next week or a week after. There still has to be a script that takes the tasks to do, using completions_wait_batch() (in inference_worker.py) and streams the results, but only a simple left-to-right completion will be required soon.

In short, the requirement "Script similar to inference_hf.py" can now read "Script similar to inference_hf.py, but only /v1/completions needs to work".

Script to test:

curl http://127.0.0.1:8008/v1/completions -k \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "smallcloudai/Refact-1_6b-fim",
  "prompt": "def hello_world():\n    \"\"\"\n    This function prints \"Hello World!!!\" and brews coffee.\n    \"\"\"",
  "stream": true,
  "echo": false,
  "stop": ["\n\n"],
  "temperature": 0.8,
  "max_tokens": 50
}'

Stream and not stream should work, CPU output should be the same as current GPU output -- sounds like a well defined criterion.

olegklimov avatar Sep 22 '23 07:09 olegklimov

@olegklimov

That's exactly what I was looking for, thank you for the update.

I'll be reviewing the other open bounties in the coming days as well.

Currently, I'm setting up a custom OS for my new workstation and finalizing the prototype interface for my personal assistant.

If I make significant progress that aligns with the criteria for any of the outstanding bounties, I'll submit an attempt and, if appropriate, a subsequent PR.

Given that I'm working against a deadline, I'm highly motivated to contribute efficiently and effectively.

teleprint-me avatar Sep 22 '23 17:09 teleprint-me

/attempt https://github.com/smallcloudai/refact/issues/77

Options

ds5t5 avatar Sep 25 '23 01:09 ds5t5

💡 @ds5t5 submitted a pull request that claims the bounty. You can visit your org dashboard to reward. 👉 @ds5t5: To receive payouts, sign up on Algora, link your Github account and connect with Stripe on your dashboard.

algora-pbc[bot] avatar Sep 25 '23 04:09 algora-pbc[bot]

Testing this:

./main -m ./Refact-1_6B-fim/ggml-model-f16.gguf -n 300 -p "write a function to multiple two integers in python"  --temp 1.0 --top-p 1.0 --top-k 1 --repeat_penalty 1.0

I see speed:

  • 17 tokens/s on my MacBook Air M1,
  • 4 tokens/s on Intel Xeon Gold 5315Y @ 3.20GHz

olegklimov avatar Sep 26 '23 07:09 olegklimov

Xeon 5315Y

Threads -t N speed tokens/s
-t 2 6
-t 4 11
-t 8 11
-t 16 4

M1 doesn't depend on threads.

olegklimov avatar Sep 26 '23 07:09 olegklimov

First token, 551 prompt:

  • 1172ms on M1
  • 25404ms on Xeon 5315Y

I'd say that's the main problem for adoption of this. 551-token prompt isn't even that big, normally we have about 1950 tokens.

olegklimov avatar Sep 26 '23 07:09 olegklimov

I tried Starcoder 1b, converted by TabbyML:

https://huggingface.co/TabbyML/StarCoder-1B/tree/main/ggml

"-m", "starcoder-1b-q8_0.gguf",
  897.71 ms /   557 tokens (    1.61 ms per token,   620.47 tokens per second)
 1334.68 ms /    49 runs   (   27.24 ms per token,    36.71 tokens per second)

"-m", "./starcoder-1b-f16.gguf",
  841.99 ms /   557 tokens (    1.51 ms per token,   661.53 tokens per second)
  243.18 ms /    49 runs   (   45.78 ms per token,    21.84 tokens per second)

"-m", "./Refact-1_6B-fim/ggml-model-f16.gguf",
  175.27 ms /   557 tokens (    2.11 ms per token,   473.93 tokens per second)
  962.51 ms /    49 runs   (   60.46 ms per token,    16.54 tokens per second)

olegklimov avatar Sep 26 '23 11:09 olegklimov

@olegklimov I think it has to do with the conversion process. They're looking into it. Typically the smaller models are much faster in llama.cpp.

teleprint-me avatar Sep 26 '23 15:09 teleprint-me

@olegklimov

  • MacBook Air M1

Try the 4-bit model, you should see a performance boost compared to the 16-bit model.

4-bit

llama_print_timings:        load time =    45.88 ms
llama_print_timings:      sample time =     3.91 ms /   300 runs   (    0.01 ms per token, 76706.72 tokens per second)
llama_print_timings: prompt eval time =    56.82 ms /     9 tokens (    6.31 ms per token,   158.38 tokens per second)
llama_print_timings:        eval time =  6762.85 ms /   299 runs   (   22.62 ms per token,    44.21 tokens per second)
llama_print_timings:       total time =  6933.22 ms

8-bit

llama_print_timings:        load time =    71.79 ms
llama_print_timings:      sample time =     3.72 ms /   300 runs   (    0.01 ms per token, 80623.49 tokens per second)
llama_print_timings: prompt eval time =    54.23 ms /     9 tokens (    6.03 ms per token,   165.94 tokens per second)
llama_print_timings:        eval time = 11387.12 ms /   299 runs   (   38.08 ms per token,    26.26 tokens per second)
llama_print_timings:       total time = 11553.91 ms

16-bit

llama_print_timings:        load time =  5828.46 ms
llama_print_timings:      sample time =     4.17 ms /   300 runs   (    0.01 ms per token, 71856.29 tokens per second)
llama_print_timings: prompt eval time =    72.36 ms /     9 tokens (    8.04 ms per token,   124.38 tokens per second)
llama_print_timings:        eval time = 20573.06 ms /   299 runs   (   68.81 ms per token,    14.53 tokens per second)
llama_print_timings:       total time = 20760.76 ms

Performance between the 16-bit and 32-bit converted tensor formats will perform the about the same on lower-end hardware.

Also, llama.cpp is still working on FIM implementation.

Quants are between 2-bit and 16-bit and support k-bit implementations if you aren't too familiar with the library or quant types.

teleprint-me avatar Sep 27 '23 03:09 teleprint-me

OK it works nicely! So all the credit goes to @ds5t5, right?

olegklimov avatar Sep 29 '23 05:09 olegklimov

@teleprint-me oh I see you've converted the 1.6b model in several quantizations, thank you for that! (I thought your tests were for llama, the name is confusing)

olegklimov avatar Sep 29 '23 05:09 olegklimov

@ds5t5 Hi there!

We are going to slightly change modelling and weights respectively at the HF. The changes will include:

  • combining attn.k and attn.v into attn.kv
  • combining mlp.linear_1 and mlp.linear_3 into mlp.gate_up_proj

Guess we need to update https://github.com/ggerganov/llama.cpp/pull/3329 as well

JegernOUTT avatar Sep 29 '23 06:09 JegernOUTT

thanks. let me know when it is ready for model weight. i will rebase my llama.cpp PR to the latest branch of llama.cpp.

ds5t5 avatar Sep 29 '23 07:09 ds5t5

@JegernOUTT can i ask why we decided to make the weight change? it seems not quite aligned with other popular models. they (falcon, llama) usually keep mlp.linear_1 and mlp.linear_3 separately. while for attention, it is usually qkv or q/k/v. only the original gpt2 model uses kv as one.

ds5t5 avatar Sep 29 '23 07:09 ds5t5