mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Command-R-Plus, Context Window Limitations

Open jeanromainroy opened this issue 1 year ago • 42 comments

Cohere's new Command-R-Plus model reportedly features a 128k context window. However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "<PAD><PAD>...") after 8192 tokens, aligning with the "max_position_embeddings" value in the config.json file. The config also lists a "rope_theta" value, suggesting its role in achieving the large context window. Is "rope" supported in MLX?

jeanromainroy avatar Apr 08 '24 04:04 jeanromainroy

However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "<PAD><PAD>...") after 8192 tokens, aligning with the "max_position_embeddings" value in the config.json file

🤔 not sure what would cause that. Do you have a prompt that should work in the MLX version that doesn't? Also if you are able to provide some expected output that would also be helpful.

The config also lists a "rope_theta" value, suggesting its role in achieving the large context window. Is "rope" supported in MLX?

MLX has RoPE and it should be used correctly already.

awni avatar Apr 08 '24 13:04 awni

I'm getting random Cyrillic in my responses when using tokenizer.apply_tool_use_template. Anyone else? Seems to only be when using that tool template from the tokenizer.

Example output:

Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the directly-answer tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:

[
    {
        "tool_name": title of the tool in the specification,
        "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
    }
]```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
Action: ```json
[
    {
        "tool некоторыми": {},
        "tool_name": "internet_search"
    } forniscono]
```<EOS_TOKEN>

fblissjr avatar Apr 08 '24 15:04 fblissjr

ignore, I was calling the tokenizer twice. fixed it in my code here for anyone who wants to test tool use (apologies in advance if there are bugs still lurking): https://github.com/fblissjr/mlx-funbox

fblissjr avatar Apr 08 '24 16:04 fblissjr

looks like there's still random switching to multilingual and random Cyrillic (using a simple generate + apply tool template). has anyone tested on CUDA to see if similar?

fblissjr avatar Apr 08 '24 16:04 fblissjr

Looks like it's the tokenizer.json that is not converting correctly. See the tokenizer.json from the Cohere HF model repo: Screenshot 2024-04-08 at 12 29 23 PM

Compared to a freshly converted mlx_lm.convert -q (no other params) I just did from that same repo 20 minutes ago, which also matches the tokenizer.json from the mlx-community quant uploaded earlier (mlx-community/c4ai-command-r-plus-4bit): Screenshot 2024-04-08 at 12 31 26 PM

fblissjr avatar Apr 08 '24 17:04 fblissjr

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)

My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.

edit: generation speed is also slightly faster now due to the correct tokenizer being used.

fblissjr avatar Apr 08 '24 17:04 fblissjr

That is very odd. The tokenizer copying is very simple in MLX LM. We basically load with Hugging Face and then save it with Hugging Face. There is no MLX code involved. https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L619

I wonder if we are somehow using the API incorrectly or maybe there is a bug in the way it's saved with Transformers.

awni avatar Apr 08 '24 17:04 awni

@fblissjr you can reproduce the behavior with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
tokenizer.save_pretrained(".")

I feel that should not break the tokenizer.. so it might be worth filing an issue with the Cohere HF repo or the Transformer repos? Wdyt?

awni avatar Apr 08 '24 17:04 awni

@awni my guess is the latter. looks more like it's saved incorrectly (and oddly just by looking at it) in the hf repo. Haven't seen a tokenizer.json like this before. here's a quick sample of 1 page on more tokenizer.json from https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json

{"version": "1.0", "truncation": null, "padding": null, "added_tokens": [{"id": 0, "content": "<PAD>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 1, "content": "<UNK>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 2, "content": "<CLS>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 3, "content": "<SEP>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 4, "content": "<MASK_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 5, "content": "<BOS_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 6, "content": "<EOS_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 7, "content": "<EOP_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 255000, "special": false, "content": "<|START_OF_TURN_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255001, "special": false, "content": "<|END_OF_TURN_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255002, "special": false, "content": "<|YES_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255003, "special": false, "content": "<|NO_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255004, "special": false, "content": "<|GOOD_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255005, "special": false, "content": "<|BAD_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255006, "special": false, "content": "<|USER_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255007, "special": false, "content": "<|CHATBOT_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255008, "special": false, "content": "<|SYSTEM_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255009, "special": false, "content": "<|USER_0_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255010, "special": false, "content": "<|USER_1_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255011, "special": false, "content": "<|USER_2_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255012, "special": false, "content": "<|USER_3_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255013, "special": false, "content": "<|USER_4_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255014, "special": false, "content": "<|USER_5_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255015, "special": false, "content": "<|USER_6_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255016, "special": false, "content": "<|USER_7_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255017, "special": false, "content": "<|USER_8_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255018, "special": false, "content": "<|USER_9_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255019, "special": false, "content": "<|EXTRA_0_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255020, "special": false, "content": "<|EXTRA_1_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255021, "special": false, "content": "<|EXTRA_2_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255022, "special": false, "content": "<|EXTRA_3_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255023, "special": false, "content": "<|EXTRA_4_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255024, "special": false, "content": "<|EXTRA_5_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255025, "special": false, "content": "<|EXTRA_6_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255026, "special": false, "content": "<|EXTRA_7_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255027, "special": false, "content": "<|EXTRA_8_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255028, "special": false, "content": "<|EXTRA_9_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}], "normalizer": {"type": "NFC"}, "pre_tokenizer": {"type": "Sequence", "pretokenizers": [{"type": "Digits", "individual_digits": true}, {"type": "ByteLevel", "add_prefix_space": false, "trim_offsets": true, "use_regex": true}]}, "post_processor": {"add_prefix_space": true, "trim_offsets": false, "use_regex": true, "type": "TemplateProcessing", "single": [{"SpecialToken": {"id": "<BOS_TOKEN>", "type_id": 0}}, {"Sequence": {"id": "A", "type_id": 0}}, {"SpecialToken": {"id": "<|END_OF_TURN_TOKEN|>", "type_id": 1}}, {"SpecialToken": {"id": "<EOS_TOKEN>", "type_id": 1}}], "pair": [{"SpecialToken": {"id": "<BOS_TOKEN>", "type_id": 0}}, {"Sequence": {"id": "A", "type_id": 0}}, {"Sequence": {"id": "B", "type_id": 1}}, {"SpecialToken": {"id": "<|END_OF_TURN_TOKEN|>", "type_id": 1}}, {"SpecialToken": {"id": "<EOS_TOKEN>", "type_id": 1}}], "special_tokens": {"<BOS_TOKEN>": {"id": "<BOS_TOKEN>", "ids": [5], "tokens": ["<BOS_TOKEN>"]}, "<EOS_TOKEN>": {"id": "<EOS_TOKEN>", "ids": [6], "tokens": ["<EOS_TOKEN>"]}, "<|END_OF_TURN_TOKEN|>": {"id": "<|END_OF_TURN_TOKEN|>", "ids": [255001], "tokens": ["<|END_OF_TURN_TOKEN|>"]}}}, "decoder": {"type": "ByteLevel", "add_prefix_space": true, "trim_offsets": true, "use_regex": true}, "model": {"type": "BPE", "dropout": null, "unk_token": null, "continuing_subword_prefix": null, "end_of_word_suffix": null, "fuse_unk": false, "byte_fallback": false, "vocab": {"<PAD>": 0, "<UNK>": 1, "<CLS>": 2, "<SEP>": 3, "<MASK_TOKEN>": 4, "<BOS_TOKEN>": 5, "<EOS_TOKEN>": 6, "<EOP_TOKEN>": 7, "!": 8, """: 9, "#": 10, "$": 11, "%": 12, "&": 13, "'": 14, "(": 15, ")": 16, "*": 17, "+": 18, ",": 19, "-": 20, ".": 21, "/": 22, "0": 23, "1": 24, "2": 25, "3": 26, "4": 27, "5": 28, "6": 29, "7": 30, "8": 31, "9": 32, ":": 33, ";": 34, "<": 35, "=": 36, ">": 37, "?": 38, "@": 39, "A": 40, "B": 41, "C": 42, "D": 43, "E": 44, "F": 45, "G": 46, "H": 47, "I": 48, "J": 49, "K": 50, "L": 51, "M": 52, "N": 53, "O": 54, "P": 55, "Q": 56, "R": 57, "S": 58, "T": 59, "U": 60, "V": 61, "W": 62, "X": 63, "Y": 64, "Z": 65, "[": 66, "\": 67, "]": 68, "^": 69, "_": 70, "`": 71, "a": 72, "b": 73, "c": 74, "d": 75, "e": 76, "f": 77, "g": 78, "h": 79, "i": 80, "j": 81, "k": 82, "l": 83, "m": 84, "n": 85, "o": 86, "p": 87, "q": 88, "r": 89, "s": 90, "t": 91, "u": 92, "v": 93, "w": 94, "x": 95, "y": 96, "z": 97, "{": 98, "|": 99, "}": 100, "~": 101, "\u00a1": 102, "\u00a2": 103, "\u00a3": 104, "\u00a4": 105, "\u00a5": 106, "\u00a6": 107, "\u00a7": 108, "\u00a8": 109, "\u00a9": 110, "\u00aa": 111, "\u00ab": 112, "\u00ac": 113, "\u00ae": 114, "\u00af": 115, "\u00b0": 116, "\u00b1": 117, "\u00b2": 118, "\u00b3": 119, "\u00b4": 120, "\u00b5": 121, "\u00b6": 122, "\u00b7": 123, "\u00b8": 124, "\u00b9": 125, "\u00ba": 126, "\u00bb": 127, "\u00bc": 128, "\u00bd": 129, "\u00be": 130, "\u00bf": 131, "\u00c0": 132, "\u00c1": 133, "\u00c2": 134, "\u00c3": 135, "\u00c4": 136, "\u00c5": 137, "\u00c6": 138, "\u00c7": 139, "\u00c8": 140, "\u00c9": 141, "\u00ca": 142, "\u00cb": 143, "\u00cc": 144, "\u00cd": 145, "\u00ce": 146, "\u00cf": 147, "\u00d0": 148, "\u00d1": 149, "\u00d2": 150, "\u00d3": 151, "\u00d4": 152, "\u00d5": 153, "\u00d6": 154, "\u00d7": 155, "\u00d8": 156, "\u00d9": 157, "\u00da": 158, "\u00db": 159, "\u00dc": 160, "\u00dd": 161, "\u00de": 162, "\u00df": 163, "\u00e0": 164, "\u00e1": 165, "\u00e2": 166, "\u00e3": 167, "\u00e4": 168, "\u00e5": 169, "\u00e6": 170, "\u00e7": 171, "\u00e8": 172, "\u00e9": 173, "\u00ea": 174, "\u00eb": 175, "\u00ec": 176, "\u00ed": 177, "\u00ee": 178, "\u00ef": 179, "\u00f0": 180, "\u00f1": 181, "\u00f2": 182, "\u00f3": 183, "\u00f4": 184, "\u00f5": 185, "\u00f6": 186, "\u00f7": 187, "\u00f8": 188, "\u00f9": 189, "\u00fa": 190, "\u00fb": 191, "\u00fc": 192, "\u00fd": 193, "\u00fe": 194, "\u00ff": 195, "\u0100": 196, "\u0101": 197, "\u0102": 198, "\u0103": 199, "\u0104": 200, "\u0105": 201, "\u0106": 202, "\u0107": 203, "\u0108": 204, "\u0109": 205, "\u010a": 206, "\u010b": 207, "\u010c": 208, "\u010d": 209, "\u010e": 210, "\u010f": 211, "\u0110": 212, "\u0111": 213, "\u0112": 214, "\u0113": 215, "\u0114": 216, "\u0115": 217, "\u0116": 218, "\u0117": 219, "\u0118": 220, "\u0119": 221, "\u011a": 222, "\u011b": 223, "\u011c": 224, "\u011d": 225, "\u011e": 226, "\u011f": 227, "\u0120": 228, "\u0121": 229, "\u0122": 230, "\u0123": 231, "\u0124": 232, "\u0125": 233, "\u0126": 234, "\u0127": 235, "\u0128": 236, "\u0129": 237, "\u012a": 238, "\u012b": 239, "\u012c": 240, "\u012d": 241, "\u012e": 242, "\u012f": 243, "\u0130": 244, "\u0131": 245, "\u0132": 246, "\u0133": 247, "\u0134": 248, "\u0135": 249, "\u0136": 250, "\u0137": 251, "\u0138": 252, "\u0139": 253, "\u013a": 254, "\u013b": 255, "\u013c": 256, "\u013d": 257, "\u013e": 258, "\u013f": 259, "\u0140": 260, "\u0141": 261, "\u0142": 262, "\u0143": 263, "\u200d": 264, "\u203c": 265, "\u2049": 266, "\u20e3": 267, "\u2122": 268, "\u2139": 269, "\u2194": 270, "\u2195": 271, "\u2196": 272, "\u2197": 273, "\u2198": 274, "\u2199": 275, "\u21a9": 276, "\u21aa": 277, "\u231a": 278, "\u231b": 279, "\u2328": 280, "\u23cf": 281, "\u23e9": 282, "\u23ea": 283, "\u23eb": 284, "\u23ec": 285, "\u23ed": 286, "\u23ee": 287, "\u23ef": 288, "\u23f0": 289, "\u23f1": 290, "\u23f2": 291, "\u23f3": 292, "\u23f8": 293, "\u23f9": 294, "\u23fa": 295, "\u24c2": 296, "\u25aa": 297, "\u25ab": 298, "\u25b6": 299, "\u25c0": 300, "\u25fb": 301, "\u25fc": 302, "\u25fd": 303, "\u25fe": 304, "\u2600": 305, "\u2601": 306, "\u2602": 307, "\u2603": 308, "\u2604": 309, "\u260e": 310, "\u2611": 311, "\u2614": 312, "\u2615": 313, "\u2618": 314, "\u261d": 315, "\u2620": 316, "\u2622": 317, "\u2623": 318, "\u2626": 319, "\u262a": 320, "\u262e": 321, "\u262f": 322, "\u2638": 323, "\u2639": 324, "\u263a": 325, "\u2640": 326, "\u2642": 327, "\u2648": 328, "\u2649": 329, "\u264a": 330, "\u264b": 331, "\u264c": 332, "\u264d": 333, "\u264e": 334, "\u264f": 335, "\u2650": 336, "\u2651": 337, "\u2652": 338, "\u2653": 339, "\u265f": 340, "\u2660": 341, "\u2663": 342, "\u2665": 343, "\u2666": 344, "\u2668": 345, "\u267b": 346, "\u267e": 347, "\u267f": 348, "\u2692": 349, "\u2693": 350, "\u2694": 351, "\u2695": 352, "\u2696": 353, "\u2697": 354, "\u2699": 355, "\u269b": 356, "\u269c": 357, "\u26a0": 358, "\u26a1": 359, "\u26a7": 360, "\u26aa": 361, "\u26ab": 362, "\u26b0": 363, "\u26b1": 364, "\u26bd": 365, "\u26be": 366, "\u26c4": 367, "\u26c5": 368, "\u26c8": 369, "\u26ce": 370, "\u26cf": 371, "\u26d1": 372, "\u26d3": 373, "\u26d4": 374, "\u26e9": 375, "\u26ea": 376, "\u26f0": 377, "\u26f1": 378, "\u26f2": 379, "\u26f3": 380, "\u26f4": 381, "\u26f5": 382, "\u26f7": 383, "\u26f8": 384, "\u26f9": 385, "\u26fa": 386, "\u26fd": 387, "\u2702": 388, "\u2705": 389, "\u2708": 390, "\u2709": 391, "\u270a": 392, "\u270b": 393, "\u270c": 394, "\u270d": 395, "\u270f": 396, "\u2712": 397, "\u2714": 398, "\u2716": 399, "\u271d": 400, "\u2721": 401, "\u2728": 402, "\u2733": 403, "\u2734": 404, "\u2744": 405, "\u2747": 406, "\u274c": 407, "\u274e": 408, "\u2753": 409, "\u2754": 410, "\u2755": 411, "\u2757": 412, "\u2763": 413, "\u276

fblissjr avatar Apr 08 '24 17:04 fblissjr

@fblissjr you can reproduce the behavior with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
tokenizer.save_pretrained(".")

I feel that should not break the tokenizer.. so it might be worth filing an issue with the Cohere HF repo or the Transformer repos? Wdyt?

agreed. i made a community post on hf here: https://huggingface.co/CohereForAI/c4ai-command-r-plus/discussions/15

and here: https://github.com/huggingface/transformers/pull/30027

fblissjr avatar Apr 08 '24 18:04 fblissjr

so this is interesting - the tokenizer.json on the bitsandbytes repo linked from the main cohere repo is a different size, and looks nothing like the original. https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit/blob/main/tokenizer.json

fblissjr avatar Apr 08 '24 20:04 fblissjr

another interesting difference between the 4 bit bnb tokenizer and the original - in the original one, token id token id 255001 <|END_OF_TURN_TOKEN|>, special is set to False. In the 4bit bnb one, it's True.

fblissjr avatar Apr 08 '24 20:04 fblissjr

Per comments on the hugging face repo, the differences between the two tokenizers.json files are unicode differences. I'll assume I've got something bugging on my end unless anyone else sees the same.

fblissjr avatar Apr 08 '24 23:04 fblissjr

# Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path


# Language Model
PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"


# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model = load_model(get_model_path(PATH_MODEL))


# Incrementally longer texts
...
text_7500_tokens = "Lorem ipsum dolor sit..."    # Works
text_8500_tokens = "Lorem ipsum dolor sit..."    # Stops working
...


# Format as list of messages
messages = [
    {"role": "user", "content": f"{text_8500_tokens}\n\nSummarize the text above in one short paragraph."}    # <-- set a text
]


# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)


# Generate
response = mlx_lm.generate(
    model,
    tokenizer,
    prompt_decorated,
    temp=0.0,
    max_tokens=64
)

This is what I have been using – I removed the texts which are just some random wikipedia page. The output is good until I try the 8500 tokens text which just outputs <PAD><PAD><PAD><PAD><PAD>...

jeanromainroy avatar Apr 09 '24 14:04 jeanromainroy

# Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path


# Language Model
PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"


# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model = load_model(get_model_path(PATH_MODEL))


# Incrementally longer texts
...
text_7500_tokens = "Lorem ipsum dolor sit..."    # Works
text_8500_tokens = "Lorem ipsum dolor sit..."    # Stops working
...


# Format as list of messages
messages = [
    {"role": "user", "content": f"{text_8500_tokens}\n\nSummarize the text above in one short paragraph."}    # <-- set a text
]


# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)


# Generate
response = mlx_lm.generate(
    model,
    tokenizer,
    prompt_decorated,
    temp=0.0,
    max_tokens=64
)

This is what I have been using – I removed the texts which are just some random wikipedia page. The output is good until I try the 8500 tokens text which just outputs <PAD><PAD><PAD><PAD><PAD>...

Have you tried with apply_tool_template by chance? Curious if you see any of the oddities I see when using it.

fblissjr avatar Apr 09 '24 15:04 fblissjr

Hey guys @awni, @fblissjr and @jeanromainroy,

The cohere team limited the context to 8k for all Command-R variants on purpose. If you check the config file for both r-v01 and r+, the max_position_embeddings is set to 8192.

It's a limit to avoid users experiencing OOM.

You can read more here: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/12

Blaizzy avatar Apr 09 '24 16:04 Blaizzy

Hey @Blaizzy, I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.

jeanromainroy avatar Apr 09 '24 16:04 jeanromainroy

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)

My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.

edit: generation speed is also slightly faster now due to the correct tokenizer being used.

@fblissjr Indeed the tokenizer created from the conversion is slightly smaller ~2MB than the original.

I updated as you suggested. Can you check it?

Blaizzy avatar Apr 09 '24 17:04 Blaizzy

Hey @Blaizzy, I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.

@jeanromainroy can you try again with the change in this branch, if it works I will make a PR.

pip install -U git+https://github.com/Blaizzy/mlx-examples.git@pc/commandR#subdirectory=llms --use-pep517 

Link: https://github.com/Blaizzy/mlx-examples/tree/pc/commandR

Blaizzy avatar Apr 09 '24 17:04 Blaizzy

You can also try to increase the default max_position_embeddings and let me know if it works.

Blaizzy avatar Apr 09 '24 17:04 Blaizzy

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!) My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting. edit: generation speed is also slightly faster now due to the correct tokenizer being used.

@fblissjr Indeed the tokenizer created from the conversion is slightly smaller ~2MB than the original.

I updated as you suggested. Can you check it?

Actually did this myself yesterday with my own quant, and output was better and faster - no idea why. And now unsure if I just had a bug somewhere on my end or if it actually made a difference.

I'm planning to test out a larger CUDA machine later today or tomorrow to see how it works natively.

fblissjr avatar Apr 09 '24 17:04 fblissjr

Let me know how it goes, but for now according to your report the issue should be fixed.

Blaizzy avatar Apr 09 '24 17:04 Blaizzy

Hey @Blaizzy , I tried your fork and the model is still outputting <PAD><PAD><PAD>... when I provide a long prompt.

jeanromainroy avatar Apr 09 '24 18:04 jeanromainroy

Hey @Blaizzy , I tried your fork and the model is still outputting <PAD><PAD><PAD>... when I provide a long prompt.

I have made a new change, can you try it again please :)

Blaizzy avatar Apr 09 '24 18:04 Blaizzy

Wait, I think I got it!

Give me 30 min :)

Blaizzy avatar Apr 09 '24 18:04 Blaizzy

@jeanromainroy can you try this branch, the previous one had a git issue:

https://github.com/Blaizzy/mlx-examples/tree/pc/command-R

Blaizzy avatar Apr 09 '24 18:04 Blaizzy

Still outputting <PAD><PAD><PAD>... :(

jeanromainroy avatar Apr 09 '24 19:04 jeanromainroy

Only PAD ? Can you share the whole output?

Blaizzy avatar Apr 09 '24 19:04 Blaizzy

It's outputting <PAD> for as long as I let it. In other words, max_tokens=256, results in 256 x <PAD>

jeanromainroy avatar Apr 09 '24 20:04 jeanromainroy

Got it!

@awni the cohere team added model_max_length set to 128K on both command-r models.

Is there a way of setting using this number with the nn.Rope? Are there any deep changes needed? If so, please point them, I can work on it.

Blaizzy avatar Apr 09 '24 21:04 Blaizzy