problem with loading llama.cpp models (tokenizer i think)
Describe the bug
im no expert but its telling me that theres a problem in C:\Users****\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py
here's the full error: Loading vicuna fast... llama.cpp weights detected: models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
llama.cpp: loading model from models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin error loading model: unrecognized tensor type 4
llama_init_from_file: failed to load model AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Traceback (most recent call last): File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict output = await app.get_blocks().process_api( File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api result = await self.call_function( File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, args) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration return next(iterator) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 229, in cai_chatbot_wrapper for history in chatbot_wrapper(text, state): File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 150, in chatbot_wrapper prompt = generate_chat_prompt(text, state, **kwargs) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 43, in generate_chat_prompt while i >= 0 and len(encode(''.join(rows))[0]) < max_length: File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 28, in encode input_ids = shared.tokenizer.encode(str(prompt)) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\llamacpp_model_alternative.py", line 38, in encode return self.model.tokenize(string) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 126, in tokenize assert self.ctx is not None AssertionError Loading vicuna fast... llama.cpp weights detected: models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
llama.cpp: loading model from models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin error loading model: unrecognized tensor type 4
llama_init_from_file: failed to load model AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Traceback (most recent call last): File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict output = await app.get_blocks().process_api( File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api result = await self.call_function( File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, args) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration return next(iterator) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 229, in cai_chatbot_wrapper for history in chatbot_wrapper(text, state): File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 150, in chatbot_wrapper prompt = generate_chat_prompt(text, state, **kwargs) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 43, in generate_chat_prompt while i >= 0 and len(encode(''.join(rows))[0]) < max_length: File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 28, in encode input_ids = shared.tokenizer.encode(str(prompt)) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\llamacpp_model_alternative.py", line 38, in encode return self.model.tokenize(string) File "C:\Users!@#$%^&(\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 126, in tokenize assert self.ctx is not None AssertionError
p.s weights are completely installed and using older models also worked. the model in the error is eachadea/ggml-vicuna-7b-1.1 (all models in the repo are broken i think), currently using eachadea/legacy-ggml-vicuna-13b-4bit. also updated cuda, don't know if that changes anything tho i dont use cuda models often because my gpu is so bad
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Clone the eachadea/ggml-vicuna-7b-1.1 from HF into models
attempt to load
get the error
Screenshot
https://user-images.githubusercontent.com/99485225/233832798-64b710c9-2fab-4953-b1ff-8b4b3e5b2f1c.mp4
Logs
Warning: --cai-chat is deprecated. Use --chat instead.
🔴 xformers not found! Please install it before trying to use it.
bin C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
The following models are available:
1. eachadea_legacy-ggml-vicuna-13b-4bit
2. elinas_vicuna-13b-4bit
3. vicuna detail
4. vicuna fast
5. vicuna fast.zip
6. vicuna unscenscored
Which one do you want to load? 1-6
1
Loading eachadea_legacy-ggml-vicuna-13b-4bit...
llama.cpp weights detected: models\eachadea_legacy-ggml-vicuna-13b-4bit\ggml-vicuna-13b-4bit-rev1.bin
llama.cpp: loading model from models\eachadea_legacy-ggml-vicuna-13b-4bit\ggml-vicuna-13b-4bit-rev1.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 73.73 KB
llama_model_load_internal: mem required = 9807.47 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Loading the extension "gallery"... Ok.
Running on local URL: http://127.0.0.1:7861
To create a public link, set `share=True` in `launch()`.
Loading elinas_vicuna-13b-4bit...
Loading vicuna detail...
Loading vicuna fast...
llama.cpp weights detected: models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
llama.cpp: loading model from models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
error loading model: unrecognized tensor type 4
llama_init_from_file: failed to load model
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Traceback (most recent call last):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
result = await self.call_function(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
return next(iterator)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 229, in cai_chatbot_wrapper
for history in chatbot_wrapper(text, state):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 150, in chatbot_wrapper
prompt = generate_chat_prompt(text, state, **kwargs)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 43, in generate_chat_prompt
while i >= 0 and len(encode(''.join(rows))[0]) < max_length:
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 28, in encode
input_ids = shared.tokenizer.encode(str(prompt))
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\llamacpp_model_alternative.py", line 38, in encode
return self.model.tokenize(string)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 126, in tokenize
assert self.ctx is not None
AssertionError
Loading vicuna fast...
llama.cpp weights detected: models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
llama.cpp: loading model from models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
error loading model: unrecognized tensor type 4
llama_init_from_file: failed to load model
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Traceback (most recent call last):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
result = await self.call_function(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
return next(iterator)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 229, in cai_chatbot_wrapper
for history in chatbot_wrapper(text, state):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 150, in chatbot_wrapper
prompt = generate_chat_prompt(text, state, **kwargs)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 43, in generate_chat_prompt
while i >= 0 and len(encode(''.join(rows))[0]) < max_length:
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 28, in encode
input_ids = shared.tokenizer.encode(str(prompt))
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\llamacpp_model_alternative.py", line 38, in encode
return self.model.tokenize(string)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 126, in tokenize
assert self.ctx is not None
AssertionError
Traceback (most recent call last):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
result = await self.call_function(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
return next(iterator)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 229, in cai_chatbot_wrapper
for history in chatbot_wrapper(text, state):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 150, in chatbot_wrapper
prompt = generate_chat_prompt(text, state, **kwargs)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 43, in generate_chat_prompt
while i >= 0 and len(encode(''.join(rows))[0]) < max_length:
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 28, in encode
input_ids = shared.tokenizer.encode(str(prompt))
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\llamacpp_model_alternative.py", line 38, in encode
return self.model.tokenize(string)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 126, in tokenize
AssertionError
Loading vicuna fast...
llama.cpp weights detected: models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
llama.cpp: loading model from models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
error loading model: unrecognized tensor type 4
llama_init_from_file: failed to load model
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Traceback (most recent call last):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
result = await self.call_function(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
return next(iterator)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 229, in cai_chatbot_wrapper
for history in chatbot_wrapper(text, state):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 150, in chatbot_wrapper
prompt = generate_chat_prompt(text, state, **kwargs)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 43, in generate_chat_prompt
while i >= 0 and len(encode(''.join(rows))[0]) < max_length:
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 28, in encode
input_ids = shared.tokenizer.encode(str(prompt))
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\llamacpp_model_alternative.py", line 38, in encode
return self.model.tokenize(string)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 126, in tokenize
AssertionError
Loading vicuna fast...
llama.cpp weights detected: models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
llama.cpp: loading model from models\vicuna fast\ggml-vicuna-13b-1.1-q4_2.bin
error loading model: unrecognized tensor type 4
llama_init_from_file: failed to load model
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Traceback (most recent call last):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
result = await self.call_function(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
return next(iterator)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 229, in cai_chatbot_wrapper
for history in chatbot_wrapper(text, state):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 150, in chatbot_wrapper
prompt = generate_chat_prompt(text, state, **kwargs)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\chat.py", line 43, in generate_chat_prompt
while i >= 0 and len(encode(''.join(rows))[0]) < max_length:
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 28, in encode
input_ids = shared.tokenizer.encode(str(prompt))
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\text-generation-webui\modules\llamacpp_model_alternative.py", line 38, in encode
return self.model.tokenize(string)
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 126, in tokenize
AssertionError
System Info
gpu: nvidia 3050ti (mobile) 4gb of vram 😭
(for this reason i use cpu models lol)
I have the same problem :(
I could not make eachadea_legacy-ggml-vicuna-13b-4bit work, but does eachadea_ggml-vicuna-13b-1.1, I possess old Precision M6700 with intel 8 core with nvidia with 2GB RAM a QUADRO K3000M, with CUDA 11. I use ubuntu 22.04 LTS, how I make it work:
- Install NVIDIA drivers! Enable your VENDOR drivers, to check are enabled run: lsmod | grep nvidia (many lines with NVIDIA will prompt) sudo apt install build-essential sudo apt install nvidia-cuda-toolkit check CUDA works: nvcc --version sudo apt install linux-headers-$(uname -r) -y
- Dowload and unzip oobabooga_linux.zip
- chmod 755 start_linux.sh and launch it
- I selected CPU as my GPU
- I picked L) Manually specify a Hugging Face model
- I selected Input> eachadea/ggml-vicuna-13b-1.1 to be downloaded, wait it take lot of time
- from subfolder web-generation-webui run the command: pip install -r requirements.txt
- open in browser http://127.0.0.1:7860/ and go to into TAB "model" settings must show like this: model type: gptj autodevices checked ONLY! as model: ggml-vicuna-13b-1.1 (sorry no screenshots allowed)
- no error or warnings are seen on prompt: lama.cpp: loading model from models/eachadea_ggml-vicuna-13b-1.1/ggml-vicuna-13b-1.1-q4_2.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 5 (mostly Q4_2) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 73.73 KB llama_model_load_internal: mem required = 9807.47 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 1600.00 MB AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Output generated in 97.04 seconds (0.09 tokens/s, 9 tokens, context 35, seed 1714019742) generate cache hit Output generated in 69.96 seconds (0.17 tokens/s, 12 tokens, context 60, seed 128444163) generate cache hit Output generated in 183.96 seconds (0.24 tokens/s, 44 tokens, context 88, seed 195261206) generate cache hit Output generated in 107.61 seconds (0.16 tokens/s, 17 tokens, context 150, seed 1609617739) generate cache hit Output generated in 105.41 seconds (0.23 tokens/s, 24 tokens, context 184, seed 913912848) generate cache hit Output generated in 202.03 seconds (0.29 tokens/s, 58 tokens, context 223, seed 1784058273) generate cache hit
Good luck!
what cpu. please add cpu? amd? intel? arm?
i have an amd cpu, ryzen 7 4800h
Try to use a more recent version of llama-cpp-python - as per https://github.com/oobabooga/text-generation-webui/blob/main/requirements.txt#L19 the version used is 0.1.36 which may not support q4_2, which is why you get error loading model: unrecognized tensor type 4.
Install a more recent version that supports q4_2:
pip freeze | grep llama
pip cache purge
pip install llama-cpp-python==0.1.39
Alternatively, install develop:
cd ~/
rm -rf llama-cpp-python
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python
sed -i 's/[email protected]:/https:\/\/github.com\//g' .gitmodules
git submodule update --init --recursive
python3 setup.py develop
pip3 uninstall -y llama-cpp-python
pip install scikit-build
python3 setup.py develop
pip freeze | grep llama # output:
-e git+https://github.com/abetlen/llama-cpp-python@9339929f56ca71adb97930679c710a2458f877bd#egg=llama_cpp_python
This worked for me for q5 support.
if you open up .\oobabooga_windows\text-generation-webui\requirements.txt and replace the last 2 lines with
llama-cpp-python==0.1.39; platform_system != "Windows"
https://github.com/abetlen/llama-cpp-python/releases/download/v0.1.39/llama_cpp_python-0.1.39-cp310-cp310-win_amd64.whl; platform_system == "Windows"
then run the update script it fixes the problem as well. (at least it did for me).
https://github.com/oobabooga/text-generation-webui/pull/1651