text-generation-webui
text-generation-webui copied to clipboard
Add c4ai-command-r-v01 Support
Why I still can't run command-r in webui even though its support was added to the main cpp branch?
log
06:09:44-615060 INFO Loading "c4ai-command-r-v01-Q4_K_M.gguf"
06:09:44-797285 INFO llama.cpp weights detected: "models/c4ai-command-r-v01-Q4_K_M.gguf"
llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from models/c4ai-command-r-v01-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = command-r
llama_model_loader: - kv 1: general.name str = 9fe64d67d13873f218cb05083b6fc2faab2d034a
llama_model_loader: - kv 2: command-r.block_count u32 = 40
llama_model_loader: - kv 3: command-r.context_length u32 = 131072
llama_model_loader: - kv 4: command-r.embedding_length u32 = 8192
llama_model_loader: - kv 5: command-r.feed_forward_length u32 = 22528
llama_model_loader: - kv 6: command-r.attention.head_count u32 = 64
llama_model_loader: - kv 7: command-r.attention.head_count_kv u32 = 64
llama_model_loader: - kv 8: command-r.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 9: command-r.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: command-r.logit_scale f32 = 0.062500
llama_model_loader: - kv 12: command-r.rope.scaling.type str = none
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,256000] = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 5
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 255001
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 41 tensors
llama_model_loader: - type q4_K: 240 tensors
llama_model_loader: - type q6_K: 41 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'command-r'
llama_load_model_from_file: failed to load model
06:09:44-885375 ERROR Failed to load the model.
Traceback (most recent call last):
File "/home/fenfel/text-gen-install/text-generation-webui/modules/ui_model_menu.py", line 245, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/modules/models.py", line 87, in load_model
output = load_func_map[loader](model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/modules/models.py", line 250, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/modules/llamacpp_model.py", line 102, in from_pretrained
result.model = Llama(**params)
^^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/llama.py", line 311, in __init__
self._model = _LlamaModel(
^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/_internals.py", line 55, in __init__
raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models/c4ai-command-r-v01-Q4_K_M.gguf
Exception ignored in: <function LlamaCppModel.__del__ at 0x7f0afd4ae340>
Traceback (most recent call last):
File "/home/fenfel/text-gen-install/text-generation-webui/modules/llamacpp_model.py", line 58, in __del__
del self.model
^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'
Waiting for Bump llama-cpp-python to 0.2.57, i think.
It just updated to a new version. Looked like llama updated too, but it still doesn't work. Strange... Koboldcpp also won't load the model.
You can run with ollama, check this. https://ollama.com/library/command-r
Someone posted tips how to run GGUF in oobabooga (I have not tried this personally so dunno if it works): https://old.reddit.com/r/LocalLLaMA/comments/1bpfx92/commandr_on_textgenerationwebui/
If you have 24 or more VRAM, you can run exl2. On 4090 Win11, I'm able to run exl2 3.0bpw. A maximum of 7168 context length fits into VRAM with 4bit cache.
Need to manually update to exl2 0.0.16 (https://github.com/turboderp/exllamav2/releases).
- Run cmd_windows.bat
-
pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.16/exllamav2-0.0.16+cu121-cp311-cp311-win_amd64.whl
(or which ever version is correct for you). - use correct instruction template https://pastebin.com/FfzkX7Zm (from the reddit post above, thanks to kiselsa!)
The quants are here:
https://huggingface.co/turboderp/command-r-v01-35B-exl2
I can run it on my AMD machines but not on Intel. On all Intel (i9), I get the error: Error: exception error loading model architecture: unknown model architecture: 'command-r'
All 64GB RAM + 12GB Nvidia
UPDATE: I reinstalled Ollama. Works now for me.
They now also released a larger, 104B parameter model: C4AI Command R+
c4ai-command-r-v01 now loads in Ooba dev
branch since 3952560da82d383f8d6dfe8a848925802e417a20.
However, it seems to randomly output different languages (or gibberish?), maybe due to a wrong/unknown chat format?. Other times it seems to work fine.
I unfortunately get a Segmentation fault every time with the new llama_cpp_python ..
I unfortunately get a Segmentation fault every time with the new llama_cpp_python ..
Yeah, same here, and not just on Command-r, so I've reverted llama-cpp-python for now. Should be fixed according to abetlen but I guess it isn't after all.
I unfortunately get a Segmentation fault every time with the new llama_cpp_python ..
Yeah, same here, and not just on Command-r, so I've reverted llama-cpp-python for now. Should be fixed according to abetlen but I guess it isn't after all.
I get a segfault on exllama2[_hf] too loading command-r-plus exl2. Other exl2 models work fine. (Edit: both on the dev branch) So that points to a shared dependency of the command-r architecture (I still know nothing about architectures).
Edit 2: building exllama2 upstream works quite well. A hint of repetition with default instruct settings on divine intellect, perhaps. I'll try GGUF next.
Edit 3: Got GGUF working as well. Built llama-cpp-python with the latest changes from https://github.com/ggerganov/llama.cpp/pull/6491, CMAKE_ARGS="-DLLAMA_CUDA=ON" FORCE_CMAKE=1 pip3 install -e .
(I had to copy some libmvec library to my lib directory too), Q4, offloaded 45 layers and got 1~2 t/s. However: repetition (again using Divine Intellect - I should probably read the model card).
c4ai-command-r-plus works if you bump Exllamav2 up to 0.0.18. May also fix support for c4ai-command-r-v01.
Notably I've not able to get either models working inside text-generation-webui with regular transformers. Whilst it will load, it only output gibberish for me (repeated words).
I'm running snapshot-2024-04-07
and changed requirements.txt as follows:
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
Chat template:
{{- '<BOS_TOKEN>' -}}
{%- for message in messages %}
{%- if message['role'] == 'system' -%}
{%- if message['content'] -%}
{{- '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + message['content'] + '<|END_OF_TURN_TOKEN|>' -}}
{%- endif -%}
{%- if user_bio -%}
{{- '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + user_bio + '<|END_OF_TURN_TOKEN|>' -}}
{%- endif -%}
{%- else -%}
{%- if message['role'] == 'user' -%}
{{- '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + name1 + ': ' + message['content'] + '<|END_OF_TURN_TOKEN|>' -}}
{%- else -%}
{{- '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + name2 + ': ' + message['content'] + '<|END_OF_TURN_TOKEN|>' -}}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
Update.
c4ai-command-r-v01 does not work due to tokenizer config clash:
Traceback (most recent call last):
File "/home/app/text-generation-webui/modules/ui_model_menu.py", line 245, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/modules/models.py", line 95, in load_model
tokenizer = load_tokenizer(model_name, model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/modules/models.py", line 117, in load_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 818, in from_pretrained
tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/dynamic_module_utils.py", line 501, in get_class_from_dynamic_module
return get_class_in_module(class_name, final_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/dynamic_module_utils.py", line 201, in get_class_in_module
module = importlib.machinery.SourceFileLoader(name, module_path).load_module()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 605, in _check_name_wrapper
File "", line 1120, in load_module
File "", line 945, in load_module
File "", line 290, in _load_module_shim
File "", line 721, in _load
File "", line 690, in _load_unlocked
File "", line 940, in exec_module
File "", line 241, in _call_with_frames_removed
File "/home/app/text-generation-webui/.cache/huggingface/modules/transformers_modules/command-r-v01-35B-exl2/tokenization_cohere_fast.py", line 31, in
from .configuration_cohere import CohereConfig
File "/home/app/text-generation-webui/.cache/huggingface/modules/transformers_modules/command-r-v01-35B-exl2/configuration_cohere.py", line 159, in
AutoConfig.register("cohere", CohereConfig)
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1191, in register
CONFIG_MAPPING.register(model_type, config, exist_ok=exist_ok)
File "/home/app/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 885, in register
raise ValueError(f"'{key}' is already used by a Transformers config, pick another name.")
ValueError: 'cohere' is already used by a Transformers config, pick another name.
Also I suspect there is something funky going on with the context memory for c4ai-command-r-plus in text-generation-webui. Was only able to achieve a context of 9k with the 5.0bpw exl2 quant.
@divine-taco Same here, repeats gibberish using transformers 8-bit and 4-bit, I have tried a lot of different settings and parameter changes. I can load it via transformers, but it will only output gibberish.
Command R+ support was just added to llama.cpp: https://github.com/ggerganov/llama.cpp/pull/6491
anyone got the 35b to run with oobabooga on mac?
PSA: the dev branch now has a fix for this. For the gibberish make sure to use the (now default) min_p preset.
Hmm, I guess the fix was just for the 35b version not the plus version? I grabbed the dev version and tried it out without any change to the output for the c4ai-command-r-plus model.
https://github.com/oobabooga/text-generation-webui/issues/5838
@RandomInternetPreson sorry I was probably too eager. I only retested the llama.cpp quants (exl2 was already working fine especially since the min_p update). You are using the full model quantized on-the-fly with bitsandbytes, right? I'll try to reproduce the issue, but I'm not sure if I have enough memory. (Edit: I have only been testing command-r plus, I haven't gotten around to the 35b model nor the new mistral models yet)
@randoentity np, I figured out my issue https://github.com/oobabooga/text-generation-webui/issues/5838#issuecomment-2053670500
Perhaps it will help others encountering this in the future.
It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.
I don't think it is any issue with textgen it's ans issue with transformers. You can update to the dev version like I did and see if that fixes your issue.
It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.
I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though
On my mac, I recently pulled the latest oobabooga version and tried running this model, this is the only model that made the entire laptop freeze and I had to forcefully restart it.
Is it working for other mac users? I tried the 35b, I could run 70b models on this laptop with the same cli arguments, but maybe this model requires different cli flags or something?
I couldn't get it running on Linux and a 7900xtx, tried both transformers and llamaCPP.
I couldn't get it running on Linux and a 7900xtx, tried both transformers and llamaCPP.
I have it running on Linux and a 4090, llama.cpp through ooba. Good luck with amd though
It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.
I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though
Yeah I can't get it to run for some reason. (even in dev branch)
It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.
I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though
Yeah I can't get it to run for some reason. (even in dev branch)
I can answer any specific questions you might have
It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.
I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though
Yeah I can't get it to run for some reason. (even in dev branch)
I can answer any specific questions you might have
Are you using GGUF or exllama?
It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.
I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though
Yeah I can't get it to run for some reason. (even in dev branch)
I can answer any specific questions you might have
Are you using GGUF or exllama?
GGUF
It's a little annoying that I still can't run the SOTA 35B model in the most popular webUI.
I'm running it just fine. Let me know if you need any help. The only thing I can't figure out is how to increase rope_freq_base to 8,000,000 in the GUI. It still runs fine though
Yeah I can't get it to run for some reason. (even in dev branch)
I can answer any specific questions you might have
Are you using GGUF or exllama?
GGUF
I reinstalled webui and still get the same error. I downloaded a new GGUF and the result is the same
21:25:07-533993 ERROR Failed to load the model.
Traceback (most recent call last):
File "/home/fenfel/text-gen-install/text-generation-webui/modules/ui_model_menu.py", line 248, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/modules/models.py", line 94, in load_model
output = load_func_map[loader](model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/modules/models.py", line 271, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/modules/llamacpp_model.py", line 102, in from_pretrained
result.model = Llama(**params)
^^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/llama.py", line 337, in __init__
self._ctx = _LlamaContext(
^^^^^^^^^^^^^^
File "/home/fenfel/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/_internals.py", line 265, in __init__
raise ValueError("Failed to create llama_context")
ValueError: Failed to create llama_context
Exception ignored in: <function LlamaCppModel.__del__ at 0x7f0114e07b00>
Traceback (most recent call last):
File "/home/fenfel/text-gen-install/text-generation-webui/modules/llamacpp_model.py", line 58, in __del__
del self.model
^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'
I asked llama-3-70b-instruct and it basically said it's a common, generic error. It said try running it on CPU or do you have enough memory?