mlx-examples
                                
                                 mlx-examples copied to clipboard
                                
                                    mlx-examples copied to clipboard
                            
                            
                            
                        Command-R-Plus, Context Window Limitations
Cohere's new Command-R-Plus model reportedly features a 128k context window. However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "<PAD><PAD>...") after 8192 tokens, aligning with the "max_position_embeddings" value in the config.json file. The config also lists a "rope_theta" value, suggesting its role in achieving the large context window. Is "rope" supported in MLX?
However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "<PAD><PAD>...") after 8192 tokens, aligning with the "max_position_embeddings" value in the config.json file
🤔 not sure what would cause that. Do you have a prompt that should work in the MLX version that doesn't? Also if you are able to provide some expected output that would also be helpful.
The config also lists a "rope_theta" value, suggesting its role in achieving the large context window. Is "rope" supported in MLX?
MLX has RoPE and it should be used correctly already.
I'm getting random Cyrillic in my responses when using tokenizer.apply_tool_use_template. Anyone else? Seems to only be when using that tool template from the tokenizer.
Example output:
Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the directly-answer tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
[
    {
        "tool_name": title of the tool in the specification,
        "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
    }
]```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
Action: ```json
[
    {
        "tool некоторыми": {},
        "tool_name": "internet_search"
    } forniscono]
```<EOS_TOKEN>
ignore, I was calling the tokenizer twice. fixed it in my code here for anyone who wants to test tool use (apologies in advance if there are bugs still lurking): https://github.com/fblissjr/mlx-funbox
looks like there's still random switching to multilingual and random Cyrillic (using a simple generate + apply tool template). has anyone tested on CUDA to see if similar?
Looks like it's the tokenizer.json that is not converting correctly. See the tokenizer.json from the Cohere HF model repo:
Compared to a freshly converted mlx_lm.convert -q (no other params) I just did from that same repo 20 minutes ago, which also matches the tokenizer.json from the mlx-community quant uploaded earlier (mlx-community/c4ai-command-r-plus-4bit):
Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)
My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.
edit: generation speed is also slightly faster now due to the correct tokenizer being used.
That is very odd. The tokenizer copying is very simple in MLX LM. We basically load with Hugging Face and then save it with Hugging Face. There is no MLX code involved. https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L619
I wonder if we are somehow using the API incorrectly or maybe there is a bug in the way it's saved with Transformers.
@fblissjr you can reproduce the behavior with:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
tokenizer.save_pretrained(".")
I feel that should not break the tokenizer.. so it might be worth filing an issue with the Cohere HF repo or the Transformer repos? Wdyt?
@awni my guess is the latter. looks more like it's saved incorrectly (and oddly just by looking at it) in the hf repo. Haven't seen a tokenizer.json like this before. here's a quick sample of 1 page on more tokenizer.json from https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json
{"version": "1.0", "truncation": null, "padding": null, "added_tokens": [{"id": 0, "content": "<PAD>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 1, "content": "<UNK>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 2, "content": "<CLS>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 3, "content": "<SEP>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 4, "content": "<MASK_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 5, "content": "<BOS_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 6, "content": "<EOS_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 7, "content": "<EOP_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 255000, "special": false, "content": "<|START_OF_TURN_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255001, "special": false, "content": "<|END_OF_TURN_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255002, "special": false, "content": "<|YES_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255003, "special": false, "content": "<|NO_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255004, "special": false, "content": "<|GOOD_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255005, "special": false, "content": "<|BAD_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255006, "special": false, "content": "<|USER_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255007, "special": false, "content": "<|CHATBOT_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255008, "special": false, "content": "<|SYSTEM_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255009, "special": false, "content": "<|USER_0_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255010, "special": false, "content": "<|USER_1_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255011, "special": false, "content": "<|USER_2_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255012, "special": false, "content": "<|USER_3_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255013, "special": false, "content": "<|USER_4_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255014, "special": false, "content": "<|USER_5_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255015, "special": false, "content": "<|USER_6_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255016, "special": false, "content": "<|USER_7_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255017, "special": false, "content": "<|USER_8_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255018, "special": false, "content": "<|USER_9_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255019, "special": false, "content": "<|EXTRA_0_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255020, "special": false, "content": "<|EXTRA_1_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255021, "special": false, "content": "<|EXTRA_2_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255022, "special": false, "content": "<|EXTRA_3_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255023, "special": false, "content": "<|EXTRA_4_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255024, "special": false, "content": "<|EXTRA_5_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255025, "special": false, "content": "<|EXTRA_6_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255026, "special": false, "content": "<|EXTRA_7_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255027, "special": false, "content": "<|EXTRA_8_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255028, "special": false, "content": "<|EXTRA_9_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}], "normalizer": {"type": "NFC"}, "pre_tokenizer": {"type": "Sequence", "pretokenizers": [{"type": "Digits", "individual_digits": true}, {"type": "ByteLevel", "add_prefix_space": false, "trim_offsets": true, "use_regex": true}]}, "post_processor": {"add_prefix_space": true, "trim_offsets": false, "use_regex": true, "type": "TemplateProcessing", "single": [{"SpecialToken": {"id": "<BOS_TOKEN>", "type_id": 0}}, {"Sequence": {"id": "A", "type_id": 0}}, {"SpecialToken": {"id": "<|END_OF_TURN_TOKEN|>", "type_id": 1}}, {"SpecialToken": {"id": "<EOS_TOKEN>", "type_id": 1}}], "pair": [{"SpecialToken": {"id": "<BOS_TOKEN>", "type_id": 0}}, {"Sequence": {"id": "A", "type_id": 0}}, {"Sequence": {"id": "B", "type_id": 1}}, {"SpecialToken": {"id": "<|END_OF_TURN_TOKEN|>", "type_id": 1}}, {"SpecialToken": {"id": "<EOS_TOKEN>", "type_id": 1}}], "special_tokens": {"<BOS_TOKEN>": {"id": "<BOS_TOKEN>", "ids": [5], "tokens": ["<BOS_TOKEN>"]}, "<EOS_TOKEN>": {"id": "<EOS_TOKEN>", "ids": [6], "tokens": ["<EOS_TOKEN>"]}, "<|END_OF_TURN_TOKEN|>": {"id": "<|END_OF_TURN_TOKEN|>", "ids": [255001], "tokens": ["<|END_OF_TURN_TOKEN|>"]}}}, "decoder": {"type": "ByteLevel", "add_prefix_space": true, "trim_offsets": true, "use_regex": true}, "model": {"type": "BPE", "dropout": null, "unk_token": null, "continuing_subword_prefix": null, "end_of_word_suffix": null, "fuse_unk": false, "byte_fallback": false, "vocab": {"<PAD>": 0, "<UNK>": 1, "<CLS>": 2, "<SEP>": 3, "<MASK_TOKEN>": 4, "<BOS_TOKEN>": 5, "<EOS_TOKEN>": 6, "<EOP_TOKEN>": 7, "!": 8, """: 9, "#": 10, "$": 11, "%": 12, "&": 13, "'": 14, "(": 15, ")": 16, "*": 17, "+": 18, ",": 19, "-": 20, ".": 21, "/": 22, "0": 23, "1": 24, "2": 25, "3": 26, "4": 27, "5": 28, "6": 29, "7": 30, "8": 31, "9": 32, ":": 33, ";": 34, "<": 35, "=": 36, ">": 37, "?": 38, "@": 39, "A": 40, "B": 41, "C": 42, "D": 43, "E": 44, "F": 45, "G": 46, "H": 47, "I": 48, "J": 49, "K": 50, "L": 51, "M": 52, "N": 53, "O": 54, "P": 55, "Q": 56, "R": 57, "S": 58, "T": 59, "U": 60, "V": 61, "W": 62, "X": 63, "Y": 64, "Z": 65, "[": 66, "\": 67, "]": 68, "^": 69, "_": 70, "`": 71, "a": 72, "b": 73, "c": 74, "d": 75, "e": 76, "f": 77, "g": 78, "h": 79, "i": 80, "j": 81, "k": 82, "l": 83, "m": 84, "n": 85, "o": 86, "p": 87, "q": 88, "r": 89, "s": 90, "t": 91, "u": 92, "v": 93, "w": 94, "x": 95, "y": 96, "z": 97, "{": 98, "|": 99, "}": 100, "~": 101, "\u00a1": 102, "\u00a2": 103, "\u00a3": 104, "\u00a4": 105, "\u00a5": 106, "\u00a6": 107, "\u00a7": 108, "\u00a8": 109, "\u00a9": 110, "\u00aa": 111, "\u00ab": 112, "\u00ac": 113, "\u00ae": 114, "\u00af": 115, "\u00b0": 116, "\u00b1": 117, "\u00b2": 118, "\u00b3": 119, "\u00b4": 120, "\u00b5": 121, "\u00b6": 122, "\u00b7": 123, "\u00b8": 124, "\u00b9": 125, "\u00ba": 126, "\u00bb": 127, "\u00bc": 128, "\u00bd": 129, "\u00be": 130, "\u00bf": 131, "\u00c0": 132, "\u00c1": 133, "\u00c2": 134, "\u00c3": 135, "\u00c4": 136, "\u00c5": 137, "\u00c6": 138, "\u00c7": 139, "\u00c8": 140, "\u00c9": 141, "\u00ca": 142, "\u00cb": 143, "\u00cc": 144, "\u00cd": 145, "\u00ce": 146, "\u00cf": 147, "\u00d0": 148, "\u00d1": 149, "\u00d2": 150, "\u00d3": 151, "\u00d4": 152, "\u00d5": 153, "\u00d6": 154, "\u00d7": 155, "\u00d8": 156, "\u00d9": 157, "\u00da": 158, "\u00db": 159, "\u00dc": 160, "\u00dd": 161, "\u00de": 162, "\u00df": 163, "\u00e0": 164, "\u00e1": 165, "\u00e2": 166, "\u00e3": 167, "\u00e4": 168, "\u00e5": 169, "\u00e6": 170, "\u00e7": 171, "\u00e8": 172, "\u00e9": 173, "\u00ea": 174, "\u00eb": 175, "\u00ec": 176, "\u00ed": 177, "\u00ee": 178, "\u00ef": 179, "\u00f0": 180, "\u00f1": 181, "\u00f2": 182, "\u00f3": 183, "\u00f4": 184, "\u00f5": 185, "\u00f6": 186, "\u00f7": 187, "\u00f8": 188, "\u00f9": 189, "\u00fa": 190, "\u00fb": 191, "\u00fc": 192, "\u00fd": 193, "\u00fe": 194, "\u00ff": 195, "\u0100": 196, "\u0101": 197, "\u0102": 198, "\u0103": 199, "\u0104": 200, "\u0105": 201, "\u0106": 202, "\u0107": 203, "\u0108": 204, "\u0109": 205, "\u010a": 206, "\u010b": 207, "\u010c": 208, "\u010d": 209, "\u010e": 210, "\u010f": 211, "\u0110": 212, "\u0111": 213, "\u0112": 214, "\u0113": 215, "\u0114": 216, "\u0115": 217, "\u0116": 218, "\u0117": 219, "\u0118": 220, "\u0119": 221, "\u011a": 222, "\u011b": 223, "\u011c": 224, "\u011d": 225, "\u011e": 226, "\u011f": 227, "\u0120": 228, "\u0121": 229, "\u0122": 230, "\u0123": 231, "\u0124": 232, "\u0125": 233, "\u0126": 234, "\u0127": 235, "\u0128": 236, "\u0129": 237, "\u012a": 238, "\u012b": 239, "\u012c": 240, "\u012d": 241, "\u012e": 242, "\u012f": 243, "\u0130": 244, "\u0131": 245, "\u0132": 246, "\u0133": 247, "\u0134": 248, "\u0135": 249, "\u0136": 250, "\u0137": 251, "\u0138": 252, "\u0139": 253, "\u013a": 254, "\u013b": 255, "\u013c": 256, "\u013d": 257, "\u013e": 258, "\u013f": 259, "\u0140": 260, "\u0141": 261, "\u0142": 262, "\u0143": 263, "\u200d": 264, "\u203c": 265, "\u2049": 266, "\u20e3": 267, "\u2122": 268, "\u2139": 269, "\u2194": 270, "\u2195": 271, "\u2196": 272, "\u2197": 273, "\u2198": 274, "\u2199": 275, "\u21a9": 276, "\u21aa": 277, "\u231a": 278, "\u231b": 279, "\u2328": 280, "\u23cf": 281, "\u23e9": 282, "\u23ea": 283, "\u23eb": 284, "\u23ec": 285, "\u23ed": 286, "\u23ee": 287, "\u23ef": 288, "\u23f0": 289, "\u23f1": 290, "\u23f2": 291, "\u23f3": 292, "\u23f8": 293, "\u23f9": 294, "\u23fa": 295, "\u24c2": 296, "\u25aa": 297, "\u25ab": 298, "\u25b6": 299, "\u25c0": 300, "\u25fb": 301, "\u25fc": 302, "\u25fd": 303, "\u25fe": 304, "\u2600": 305, "\u2601": 306, "\u2602": 307, "\u2603": 308, "\u2604": 309, "\u260e": 310, "\u2611": 311, "\u2614": 312, "\u2615": 313, "\u2618": 314, "\u261d": 315, "\u2620": 316, "\u2622": 317, "\u2623": 318, "\u2626": 319, "\u262a": 320, "\u262e": 321, "\u262f": 322, "\u2638": 323, "\u2639": 324, "\u263a": 325, "\u2640": 326, "\u2642": 327, "\u2648": 328, "\u2649": 329, "\u264a": 330, "\u264b": 331, "\u264c": 332, "\u264d": 333, "\u264e": 334, "\u264f": 335, "\u2650": 336, "\u2651": 337, "\u2652": 338, "\u2653": 339, "\u265f": 340, "\u2660": 341, "\u2663": 342, "\u2665": 343, "\u2666": 344, "\u2668": 345, "\u267b": 346, "\u267e": 347, "\u267f": 348, "\u2692": 349, "\u2693": 350, "\u2694": 351, "\u2695": 352, "\u2696": 353, "\u2697": 354, "\u2699": 355, "\u269b": 356, "\u269c": 357, "\u26a0": 358, "\u26a1": 359, "\u26a7": 360, "\u26aa": 361, "\u26ab": 362, "\u26b0": 363, "\u26b1": 364, "\u26bd": 365, "\u26be": 366, "\u26c4": 367, "\u26c5": 368, "\u26c8": 369, "\u26ce": 370, "\u26cf": 371, "\u26d1": 372, "\u26d3": 373, "\u26d4": 374, "\u26e9": 375, "\u26ea": 376, "\u26f0": 377, "\u26f1": 378, "\u26f2": 379, "\u26f3": 380, "\u26f4": 381, "\u26f5": 382, "\u26f7": 383, "\u26f8": 384, "\u26f9": 385, "\u26fa": 386, "\u26fd": 387, "\u2702": 388, "\u2705": 389, "\u2708": 390, "\u2709": 391, "\u270a": 392, "\u270b": 393, "\u270c": 394, "\u270d": 395, "\u270f": 396, "\u2712": 397, "\u2714": 398, "\u2716": 399, "\u271d": 400, "\u2721": 401, "\u2728": 402, "\u2733": 403, "\u2734": 404, "\u2744": 405, "\u2747": 406, "\u274c": 407, "\u274e": 408, "\u2753": 409, "\u2754": 410, "\u2755": 411, "\u2757": 412, "\u2763": 413, "\u276
@fblissjr you can reproduce the behavior with:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus") tokenizer.save_pretrained(".")I feel that should not break the tokenizer.. so it might be worth filing an issue with the Cohere HF repo or the Transformer repos? Wdyt?
agreed. i made a community post on hf here: https://huggingface.co/CohereForAI/c4ai-command-r-plus/discussions/15
and here: https://github.com/huggingface/transformers/pull/30027
so this is interesting - the tokenizer.json on the bitsandbytes repo linked from the main cohere repo is a different size, and looks nothing like the original. https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit/blob/main/tokenizer.json
another interesting difference between the 4 bit bnb tokenizer and the original - in the original one, token id token id 255001 <|END_OF_TURN_TOKEN|>, special is set to False. In the 4bit bnb one, it's True.
Per comments on the hugging face repo, the differences between the two tokenizers.json files are unicode differences. I'll assume I've got something bugging on my end unless anyone else sees the same.
# Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path
# Language Model
PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"
# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model = load_model(get_model_path(PATH_MODEL))
# Incrementally longer texts
...
text_7500_tokens = "Lorem ipsum dolor sit..."    # Works
text_8500_tokens = "Lorem ipsum dolor sit..."    # Stops working
...
# Format as list of messages
messages = [
    {"role": "user", "content": f"{text_8500_tokens}\n\nSummarize the text above in one short paragraph."}    # <-- set a text
]
# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)
# Generate
response = mlx_lm.generate(
    model,
    tokenizer,
    prompt_decorated,
    temp=0.0,
    max_tokens=64
)
This is what I have been using – I removed the texts which are just some random wikipedia page. The output is good until I try the 8500 tokens text which just outputs <PAD><PAD><PAD><PAD><PAD>...
# Libraries from transformers import AutoTokenizer import mlx.core as mx import mlx_lm from mlx_lm.utils import load_model, get_model_path # Language Model PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/" # Load the model & tokenizer tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL) model = load_model(get_model_path(PATH_MODEL)) # Incrementally longer texts ... text_7500_tokens = "Lorem ipsum dolor sit..." # Works text_8500_tokens = "Lorem ipsum dolor sit..." # Stops working ... # Format as list of messages messages = [ {"role": "user", "content": f"{text_8500_tokens}\n\nSummarize the text above in one short paragraph."} # <-- set a text ] # Apply chat template prompt_decorated = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # Generate response = mlx_lm.generate( model, tokenizer, prompt_decorated, temp=0.0, max_tokens=64 )This is what I have been using – I removed the texts which are just some random wikipedia page. The output is good until I try the 8500 tokens text which just outputs <PAD><PAD><PAD><PAD><PAD>...
Have you tried with apply_tool_template by chance? Curious if you see any of the oddities I see when using it.
Hey guys @awni, @fblissjr and @jeanromainroy,
The cohere team limited the context to 8k for all Command-R variants on purpose. If you check the config file for both r-v01 and r+, the max_position_embeddings is set to 8192.
It's a limit to avoid users experiencing OOM.
You can read more here: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/12
Hey @Blaizzy, I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.
Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)
My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.
edit: generation speed is also slightly faster now due to the correct tokenizer being used.
@fblissjr Indeed the tokenizer created from the conversion is slightly smaller ~2MB than the original.
I updated as you suggested. Can you check it?
Hey @Blaizzy, I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.
@jeanromainroy can you try again with the change in this branch, if it works I will make a PR.
pip install -U git+https://github.com/Blaizzy/mlx-examples.git@pc/commandR#subdirectory=llms --use-pep517 
Link: https://github.com/Blaizzy/mlx-examples/tree/pc/commandR
You can also try to increase the default max_position_embeddings and let me know if it works.
Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!) My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting. edit: generation speed is also slightly faster now due to the correct tokenizer being used.
@fblissjr Indeed the tokenizer created from the conversion is slightly smaller ~2MB than the original.
I updated as you suggested. Can you check it?
Actually did this myself yesterday with my own quant, and output was better and faster - no idea why. And now unsure if I just had a bug somewhere on my end or if it actually made a difference.
I'm planning to test out a larger CUDA machine later today or tomorrow to see how it works natively.
Let me know how it goes, but for now according to your report the issue should be fixed.
Hey @Blaizzy , I tried your fork and the model is still outputting <PAD><PAD><PAD>... when I provide a long prompt.
Hey @Blaizzy , I tried your fork and the model is still outputting <PAD><PAD><PAD>... when I provide a long prompt.
I have made a new change, can you try it again please :)
Wait, I think I got it!
Give me 30 min :)
@jeanromainroy can you try this branch, the previous one had a git issue:
https://github.com/Blaizzy/mlx-examples/tree/pc/command-R
Still outputting <PAD><PAD><PAD>... :(
Only PAD ? Can you share the whole output?
It's outputting <PAD> for as long as I let it. In other words, max_tokens=256, results in 256 x <PAD>
Got it!
@awni the cohere team added model_max_length set to 128K on both command-r models.
Is there a way of setting using this number with the nn.Rope? Are there any deep changes needed? If so, please point them, I can work on it.