mlx-examples
mlx-examples copied to clipboard
Add support for Cohere's Command-R
This PR adds support for Cohere's Command-R model. Twitter: @Prince_Canuma
Wow! 🚤
Does it work yet?
Not yet, I'll ping you when ready.
I'm still working on it :)
Looking forward to testing the long sequence length performance!
@awni could you give it a try? I would like to see the results.
At the moment, I can only run the 2-bit version due to space.
Looking forward to testing the long sequence length performance!
Me too :)
Yes, it's downloading. I will let you know how it goes.
At the moment, I can only run the 2-bit version due to space.
2-bit almost never works well, I don't recommend even trying it..
2-bit almost never works well, I don't recommend even trying it..
Now I know 😅
The issue is that I currently have a base M1 with 16GB of RAM.
Working on getting a new one very soon.
Yes, it's downloading. I will let you know how it goes.
Thanks, can't wait!
A 4-bit version produces the following. It looks OK, not obviously wrong but not that good either.
python -m mlx_lm.generate --model mlx_model --prompt "Hello, how are you?"
==========
Prompt: Hello, how are you?
I hope you are having a lovely week?
As promised, I’m here to share my latest sewing project, a dress that I made. This one is the first project of the year. A simple dress with the pattern, a simple dress, but very useful. It’s not for the summer, a summer dress, very easy to make, with a pattern that I have found on Etsy, the pattern is a pattern and I think it’s the Simplicity 136
==========
@awni looks like it uses default chat template in the tokenizer, maybe remove check tokenizer.chat_template is not None here to use the default chat template https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/generate.py#L113C23-L113C36
At the moment, I can only run the 2-bit version due to space.
2-bit almost never works well, I don't recommend even trying it..
You can try QuIP: https://github.com/Cornell-RelaxML/QuIP
A 4-bit version produces the following. It looks OK, not obviously wrong but not that good either.
python -m mlx_lm.generate --model mlx_model --prompt "Hello, how are you?"========== Prompt: Hello, how are you? I hope you are having a lovely week? As promised, I’m here to share my latest sewing project, a dress that I made. This one is the first project of the year. A simple dress with the pattern, a simple dress, but very useful. It’s not for the summer, a summer dress, very easy to make, with a pattern that I have found on Etsy, the pattern is a pattern and I think it’s the Simplicity 136 ==========
As @mzbac indicated, the chat template is missing.
I ran some tests on the transformers implemation of Command-R in 4bit and got a similar result to yours @awni. ❌
model_id = "prince-canuma/c4ai-command-r-v01-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)|
quantized_model = AutoModelForCausalLM. from_pretrained(model_id)
tokens = tokenizer ("Hello, how are you?", return_tensors='pt').to('cuda')
outputs = quantized_model.generate(**tokens, max_new_tokens=40)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs [0], skip_special_tokens=False)
====
<BOS_TOKEN>Hello, how are you? I hope you are well. I'm doing okay. I'm still working on my book. I'm in the middle of the second draft. I'm also working on a new
====
Then I created the chat template and add it to the tokenizer, and it worked! ✅
messages = ["role": "user", "content": "Hello, how are you?"}]
tokens = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"). to('cuda')
outputs = quantized_model.generate(**tokens, max_new_tokens=40)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs [0], skip_special_tokens=False)
<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN |>
<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hello! I'm doing well, thank you for asking! I'm excited to assist you and I'm looking forward to hearing your questions. How can I help you today?<| END_OF_TURN_TOKE NI>
You can download the updated tokenizer with the chat template here.
Please give it a try and let me know how it goes :)
Awesome, I will try with the chat template! If you are able to upload the 4-bit version w/ the chat template to the MLX Community I think that would be really great.
For this PR could you run the pre-commit hooks for format? Otherwise, ,LGTM, we can merge it !
- Fixed rope to traditional
- Fixed an issue with layer norm upcasting to fp32
- Rebased on main + ran formatting
- Fixed rope to traditional
- Fixed an issue with layer norm upcasting to fp32
- Rebased on main + ran formatting
Thank you very much @awni ! Btw, could you explain what is the difference between rope traditional on and off? When should I use one vs the other? Also, what output did you get with it off?
Awesome, I will try with the chat template! If you are able to upload the 4-bit version w/ the chat template to the MLX Community I think that would be really great.
For this PR could you run the pre-commit hooks for format? Otherwise, ,LGTM, we can merge it !
Way ahead of you, already working on it :)
Btw, could you explain what is the difference between rope traditional on and off? When should I use one vs the other? Also, what output did you get with it off?
Check the comments here https://github.com/ml-explore/mlx-examples/pull/565#discussion_r1523275276
Way ahead of you, already working on it :)
Thanks!
Done the 4bit model with updated tokenizer is available in th hub. Link: mlx-community/c4ai-command-r-v01-4bit