text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Add c4ai-command-r-v01 Support

Open Fenfel opened this issue 10 months ago • 31 comments

Why I still can't run command-r in webui even though its support was added to the main cpp branch?

original model GGUF model

log

06:09:44-615060 INFO     Loading "c4ai-command-r-v01-Q4_K_M.gguf"
06:09:44-797285 INFO     llama.cpp weights detected: "models/c4ai-command-r-v01-Q4_K_M.gguf"
llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from models/c4ai-command-r-v01-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = 9fe64d67d13873f218cb05083b6fc2faab2d034a
llama_model_loader: - kv   2:                      command-r.block_count u32              = 40
llama_model_loader: - kv   3:                   command-r.context_length u32              = 131072
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 8192
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 22528
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 64
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 64
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 8000000.000000
llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.062500
llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   41 tensors
llama_model_loader: - type q4_K:  240 tensors
llama_model_loader: - type q6_K:   41 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'command-r'
llama_load_model_from_file: failed to load model
06:09:44-885375 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/ui_model_menu.py", line 245, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/models.py", line 87, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/models.py", line 250, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/llamacpp_model.py", line 102, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/llama.py", line 311, in __init__
    self._model = _LlamaModel(
                  ^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/_internals.py", line 55, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models/c4ai-command-r-v01-Q4_K_M.gguf

Exception ignored in: <function LlamaCppModel.__del__ at 0x7f0afd4ae340>
Traceback (most recent call last):
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/llamacpp_model.py", line 58, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

Fenfel avatar Mar 27 '24 03:03 Fenfel

Waiting for Bump llama-cpp-python to 0.2.57, i think.

zhenweiding avatar Mar 27 '24 05:03 zhenweiding

It just updated to a new version. Looked like llama updated too, but it still doesn't work. Strange... Koboldcpp also won't load the model.

mlsterpr0 avatar Mar 29 '24 19:03 mlsterpr0

You can run with ollama, check this. https://ollama.com/library/command-r

zhenweiding avatar Mar 29 '24 22:03 zhenweiding

Someone posted tips how to run GGUF in oobabooga (I have not tried this personally so dunno if it works): https://old.reddit.com/r/LocalLLaMA/comments/1bpfx92/commandr_on_textgenerationwebui/

If you have 24 or more VRAM, you can run exl2. On 4090 Win11, I'm able to run exl2 3.0bpw. A maximum of 7168 context length fits into VRAM with 4bit cache.

Need to manually update to exl2 0.0.16 (https://github.com/turboderp/exllamav2/releases).

  • Run cmd_windows.bat
  • pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.16/exllamav2-0.0.16+cu121-cp311-cp311-win_amd64.whl (or which ever version is correct for you).
  • use correct instruction template https://pastebin.com/FfzkX7Zm (from the reddit post above, thanks to kiselsa!)

The quants are here:

https://huggingface.co/turboderp/command-r-v01-35B-exl2

jepjoo avatar Mar 30 '24 07:03 jepjoo

I can run it on my AMD machines but not on Intel. On all Intel (i9), I get the error: Error: exception error loading model architecture: unknown model architecture: 'command-r'

All 64GB RAM + 12GB Nvidia

UPDATE: I reinstalled Ollama. Works now for me.

danielrpfeiffer avatar Mar 30 '24 16:03 danielrpfeiffer

They now also released a larger, 104B parameter model: C4AI Command R+

EwoutH avatar Apr 04 '24 15:04 EwoutH

c4ai-command-r-v01 now loads in Ooba dev branch since 3952560da82d383f8d6dfe8a848925802e417a20. However, it seems to randomly output different languages (or gibberish?), maybe due to a wrong/unknown chat format?. Other times it seems to work fine.

TheLounger avatar Apr 04 '24 20:04 TheLounger

I unfortunately get a Segmentation fault every time with the new llama_cpp_python ..

nalf3in avatar Apr 04 '24 21:04 nalf3in

I unfortunately get a Segmentation fault every time with the new llama_cpp_python ..

Yeah, same here, and not just on Command-r, so I've reverted llama-cpp-python for now. Should be fixed according to abetlen but I guess it isn't after all.

TheLounger avatar Apr 04 '24 21:04 TheLounger

I unfortunately get a Segmentation fault every time with the new llama_cpp_python ..

Yeah, same here, and not just on Command-r, so I've reverted llama-cpp-python for now. Should be fixed according to abetlen but I guess it isn't after all.

I get a segfault on exllama2[_hf] too loading command-r-plus exl2. Other exl2 models work fine. (Edit: both on the dev branch) So that points to a shared dependency of the command-r architecture (I still know nothing about architectures).

Edit 2: building exllama2 upstream works quite well. A hint of repetition with default instruct settings on divine intellect, perhaps. I'll try GGUF next.

Edit 3: Got GGUF working as well. Built llama-cpp-python with the latest changes from https://github.com/ggerganov/llama.cpp/pull/6491, CMAKE_ARGS="-DLLAMA_CUDA=ON" FORCE_CMAKE=1 pip3 install -e . (I had to copy some libmvec library to my lib directory too), Q4, offloaded 45 layers and got 1~2 t/s. However: repetition (again using Divine Intellect - I should probably read the model card).

randoentity avatar Apr 07 '24 10:04 randoentity

c4ai-command-r-plus works if you bump Exllamav2 up to 0.0.18. May also fix support for c4ai-command-r-v01.

Notably I've not able to get either models working inside text-generation-webui with regular transformers. Whilst it will load, it only output gibberish for me (repeated words).

I'm running snapshot-2024-04-07 and changed requirements.txt as follows:

https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"

Chat template:

{{- '<BOS_TOKEN>' -}}
{%- for message in messages %}
    {%- if message['role'] == 'system' -%}
        {%- if message['content'] -%}
            {{- '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + message['content'] + '<|END_OF_TURN_TOKEN|>' -}}
        {%- endif -%}
        {%- if user_bio -%}
            {{- '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + user_bio + '<|END_OF_TURN_TOKEN|>' -}}
        {%- endif -%}
    {%- else -%}
        {%- if message['role'] == 'user' -%}
            {{- '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + name1 + ': ' + message['content'] + '<|END_OF_TURN_TOKEN|>' -}}
        {%- else -%}
            {{- '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + name2 + ': ' + message['content'] + '<|END_OF_TURN_TOKEN|>' -}}
        {%- endif -%}
    {%- endif -%}
{%- endfor -%}

Update.

c4ai-command-r-v01 does not work due to tokenizer config clash:

Traceback (most recent call last):
File "/home/app/text-generation-webui/modules/ui_model_menu.py", line 245, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/modules/models.py", line 95, in load_model
tokenizer = load_tokenizer(model_name, model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/modules/models.py", line 117, in load_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 818, in from_pretrained
tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/dynamic_module_utils.py", line 501, in get_class_from_dynamic_module
return get_class_in_module(class_name, final_module)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/dynamic_module_utils.py", line 201, in get_class_in_module
module = importlib.machinery.SourceFileLoader(name, module_path).load_module()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 605, in _check_name_wrapper
File "", line 1120, in load_module
File "", line 945, in load_module
File "", line 290, in _load_module_shim
File "", line 721, in _load
File "", line 690, in _load_unlocked
File "", line 940, in exec_module
File "", line 241, in _call_with_frames_removed
File "/home/app/text-generation-webui/.cache/huggingface/modules/transformers_modules/command-r-v01-35B-exl2/tokenization_cohere_fast.py", line 31, in
from .configuration_cohere import CohereConfig
File "/home/app/text-generation-webui/.cache/huggingface/modules/transformers_modules/command-r-v01-35B-exl2/configuration_cohere.py", line 159, in
AutoConfig.register("cohere", CohereConfig)
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1191, in register
CONFIG_MAPPING.register(model_type, config, exist_ok=exist_ok)
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 885, in register
raise ValueError(f"'{key}' is already used by a Transformers config, pick another name.")
ValueError: 'cohere' is already used by a Transformers config, pick another name.

Also I suspect there is something funky going on with the context memory for c4ai-command-r-plus in text-generation-webui. Was only able to achieve a context of 9k with the 5.0bpw exl2 quant.

divine-taco avatar Apr 08 '24 00:04 divine-taco

@divine-taco Same here, repeats gibberish using transformers 8-bit and 4-bit, I have tried a lot of different settings and parameter changes. I can load it via transformers, but it will only output gibberish.

RandomInternetPreson avatar Apr 08 '24 22:04 RandomInternetPreson

Command R+ support was just added to llama.cpp: https://github.com/ggerganov/llama.cpp/pull/6491

christiandaley avatar Apr 09 '24 20:04 christiandaley

anyone got the 35b to run with oobabooga on mac?

Madd0g avatar Apr 09 '24 20:04 Madd0g

PSA: the dev branch now has a fix for this. For the gibberish make sure to use the (now default) min_p preset.

randoentity avatar Apr 12 '24 17:04 randoentity

Hmm, I guess the fix was just for the 35b version not the plus version? I grabbed the dev version and tried it out without any change to the output for the c4ai-command-r-plus model.

https://github.com/oobabooga/text-generation-webui/issues/5838

RandomInternetPreson avatar Apr 12 '24 21:04 RandomInternetPreson

@RandomInternetPreson sorry I was probably too eager. I only retested the llama.cpp quants (exl2 was already working fine especially since the min_p update). You are using the full model quantized on-the-fly with bitsandbytes, right? I'll try to reproduce the issue, but I'm not sure if I have enough memory. (Edit: I have only been testing command-r plus, I haven't gotten around to the 35b model nor the new mistral models yet)

randoentity avatar Apr 13 '24 03:04 randoentity

@randoentity np, I figured out my issue https://github.com/oobabooga/text-generation-webui/issues/5838#issuecomment-2053670500

Perhaps it will help others encountering this in the future.

RandomInternetPreson avatar Apr 13 '24 15:04 RandomInternetPreson

It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.

Fenfel avatar Apr 15 '24 03:04 Fenfel

I don't think it is any issue with textgen it's ans issue with transformers. You can update to the dev version like I did and see if that fixes your issue.

RandomInternetPreson avatar Apr 15 '24 11:04 RandomInternetPreson

It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.

I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though

oldgithubman avatar Apr 15 '24 18:04 oldgithubman

On my mac, I recently pulled the latest oobabooga version and tried running this model, this is the only model that made the entire laptop freeze and I had to forcefully restart it.

Is it working for other mac users? I tried the 35b, I could run 70b models on this laptop with the same cli arguments, but maybe this model requires different cli flags or something?

Madd0g avatar Apr 17 '24 10:04 Madd0g

I couldn't get it running on Linux and a 7900xtx, tried both transformers and llamaCPP.

hchasens avatar Apr 18 '24 03:04 hchasens

I couldn't get it running on Linux and a 7900xtx, tried both transformers and llamaCPP.

I have it running on Linux and a 4090, llama.cpp through ooba. Good luck with amd though

oldgithubman avatar Apr 18 '24 04:04 oldgithubman

It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.

I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though

Yeah I can't get it to run for some reason. (even in dev branch)

Fenfel avatar Apr 18 '24 20:04 Fenfel

It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.

I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though

Yeah I can't get it to run for some reason. (even in dev branch)

I can answer any specific questions you might have

oldgithubman avatar Apr 19 '24 02:04 oldgithubman

It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.

I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though

Yeah I can't get it to run for some reason. (even in dev branch)

I can answer any specific questions you might have

Are you using GGUF or exllama?

Fenfel avatar Apr 19 '24 14:04 Fenfel

It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.

I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though

Yeah I can't get it to run for some reason. (even in dev branch)

I can answer any specific questions you might have

Are you using GGUF or exllama?

GGUF

oldgithubman avatar Apr 19 '24 17:04 oldgithubman

It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.

I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though

Yeah I can't get it to run for some reason. (even in dev branch)

I can answer any specific questions you might have

Are you using GGUF or exllama?

GGUF

I reinstalled webui and still get the same error. I downloaded a new GGUF and the result is the same

21:25:07-533993 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/ui_model_menu.py", line 248, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/models.py", line 94, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/models.py", line 271, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/llamacpp_model.py", line 102, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/llama.py", line 337, in __init__
    self._ctx = _LlamaContext(
                ^^^^^^^^^^^^^^
  File "/home/fenfel/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/_internals.py", line 265, in __init__
    raise ValueError("Failed to create llama_context")
ValueError: Failed to create llama_context

Exception ignored in: <function LlamaCppModel.__del__ at 0x7f0114e07b00>
Traceback (most recent call last):
  File "/home/fenfel/text-gen-install/text-generation-webui/modules/llamacpp_model.py", line 58, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

Fenfel avatar Apr 20 '24 19:04 Fenfel

I asked llama-3-70b-instruct and it basically said it's a common, generic error. It said try running it on CPU or do you have enough memory?

oldgithubman avatar Apr 20 '24 20:04 oldgithubman