mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Add support for Cohere's Command-R

Open Blaizzy opened this issue 1 year ago • 3 comments

This PR adds support for Cohere's Command-R model. Twitter: @Prince_Canuma

Blaizzy avatar Mar 11 '24 21:03 Blaizzy

Wow! 🚤

Does it work yet?

awni avatar Mar 11 '24 21:03 awni

Not yet, I'll ping you when ready.

I'm still working on it :)

Blaizzy avatar Mar 11 '24 21:03 Blaizzy

Looking forward to testing the long sequence length performance!

atiorh avatar Mar 11 '24 23:03 atiorh

@awni could you give it a try? I would like to see the results.

At the moment, I can only run the 2-bit version due to space.

Blaizzy avatar Mar 12 '24 15:03 Blaizzy

Looking forward to testing the long sequence length performance!

Me too :)

Blaizzy avatar Mar 12 '24 15:03 Blaizzy

Yes, it's downloading. I will let you know how it goes.

awni avatar Mar 12 '24 21:03 awni

At the moment, I can only run the 2-bit version due to space.

2-bit almost never works well, I don't recommend even trying it..

awni avatar Mar 12 '24 21:03 awni

2-bit almost never works well, I don't recommend even trying it..

Now I know 😅

The issue is that I currently have a base M1 with 16GB of RAM.

Working on getting a new one very soon.

Blaizzy avatar Mar 12 '24 22:03 Blaizzy

Yes, it's downloading. I will let you know how it goes.

Thanks, can't wait!

Blaizzy avatar Mar 12 '24 22:03 Blaizzy

A 4-bit version produces the following. It looks OK, not obviously wrong but not that good either. python -m mlx_lm.generate --model mlx_model --prompt "Hello, how are you?"

==========
Prompt: Hello, how are you?
 I hope you are having a lovely week?
As promised, I’m here to share my latest sewing project, a dress that I made. This one is the first project of the year. A simple dress with the pattern, a simple dress, but very useful. It’s not for the summer, a summer dress, very easy to make, with a pattern that I have found on Etsy, the pattern is a pattern and I think it’s the Simplicity 136
==========

awni avatar Mar 12 '24 23:03 awni

@awni looks like it uses default chat template in the tokenizer, maybe remove check tokenizer.chat_template is not None here to use the default chat template https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/generate.py#L113C23-L113C36

mzbac avatar Mar 12 '24 23:03 mzbac

At the moment, I can only run the 2-bit version due to space.

2-bit almost never works well, I don't recommend even trying it..

You can try QuIP: https://github.com/Cornell-RelaxML/QuIP

saurabhdash avatar Mar 12 '24 23:03 saurabhdash

A 4-bit version produces the following. It looks OK, not obviously wrong but not that good either. python -m mlx_lm.generate --model mlx_model --prompt "Hello, how are you?"

==========
Prompt: Hello, how are you?
 I hope you are having a lovely week?
As promised, I’m here to share my latest sewing project, a dress that I made. This one is the first project of the year. A simple dress with the pattern, a simple dress, but very useful. It’s not for the summer, a summer dress, very easy to make, with a pattern that I have found on Etsy, the pattern is a pattern and I think it’s the Simplicity 136
==========

As @mzbac indicated, the chat template is missing.

I ran some tests on the transformers implemation of Command-R in 4bit and got a similar result to yours @awni. ❌

model_id = "prince-canuma/c4ai-command-r-v01-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)|
quantized_model = AutoModelForCausalLM. from_pretrained(model_id)
tokens = tokenizer ("Hello, how are you?", return_tensors='pt').to('cuda')
outputs = quantized_model.generate(**tokens, max_new_tokens=40)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs [0], skip_special_tokens=False)
====
<BOS_TOKEN>Hello, how are you? I hope you are well. I'm doing okay. I'm still working on my book. I'm in the middle of the second draft. I'm also working on a new
====

Then I created the chat template and add it to the tokenizer, and it worked! ✅

messages = ["role": "user", "content": "Hello, how are you?"}]
tokens = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"). to('cuda')
outputs = quantized_model.generate(**tokens, max_new_tokens=40)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs [0], skip_special_tokens=False)
<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN |>
<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hello! I'm doing well, thank you for asking! I'm excited to assist you and I'm looking forward to hearing your questions. How can I help you today?<| END_OF_TURN_TOKE NI>

You can download the updated tokenizer with the chat template here.

Please give it a try and let me know how it goes :)

Blaizzy avatar Mar 13 '24 10:03 Blaizzy

Awesome, I will try with the chat template! If you are able to upload the 4-bit version w/ the chat template to the MLX Community I think that would be really great.

For this PR could you run the pre-commit hooks for format? Otherwise, ,LGTM, we can merge it !

awni avatar Mar 13 '24 13:03 awni

  • Fixed rope to traditional
  • Fixed an issue with layer norm upcasting to fp32
  • Rebased on main + ran formatting

awni avatar Mar 13 '24 13:03 awni

  • Fixed rope to traditional
  • Fixed an issue with layer norm upcasting to fp32
  • Rebased on main + ran formatting

Thank you very much @awni ! Btw, could you explain what is the difference between rope traditional on and off? When should I use one vs the other? Also, what output did you get with it off?

Awesome, I will try with the chat template! If you are able to upload the 4-bit version w/ the chat template to the MLX Community I think that would be really great.

For this PR could you run the pre-commit hooks for format? Otherwise, ,LGTM, we can merge it !

Way ahead of you, already working on it :)

Blaizzy avatar Mar 13 '24 15:03 Blaizzy

Btw, could you explain what is the difference between rope traditional on and off? When should I use one vs the other? Also, what output did you get with it off?

Check the comments here https://github.com/ml-explore/mlx-examples/pull/565#discussion_r1523275276

Way ahead of you, already working on it :)

Thanks!

awni avatar Mar 13 '24 15:03 awni

Done the 4bit model with updated tokenizer is available in th hub. Link: mlx-community/c4ai-command-r-v01-4bit

Blaizzy avatar Mar 13 '24 16:03 Blaizzy