text-generation-webui
text-generation-webui copied to clipboard
Says, "Is Typing..." But doesn't and resets
Describe the bug
It's trying to chat with me, but can't get out a single word and then clears the screen and starts over. The first time it tries sometimes takes a while, but subsequent attempts are really fast. (as if generating input is actually the Clear History button)
My machine is old, but I was hoping I could get by with slow performance. I'm not sure what the reset behavior means.
I've tried 3 or 4 models with various settings but this behavior is consistent.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Install. Set Flags:
run_cmd("python server.py --chat --auto-devices --gpu-memory 3300MiB", environment=True) # put your flags here!
Run with various Models.
Screenshot
Logs
INFO:Gradio HTTP request redirected to localhost :)
bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
ERROR:No model is loaded! Select one in the Model tab.
ERROR:No model is loaded! Select one in the Model tab.
INFO:Loading 4bit_WizardLM-7B-uncensored-GPTQ...
INFO:Found the following quantized model: models\4bit_WizardLM-7B-uncensored-GPTQ\WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
INFO:Using the following device map for the quantized model:
INFO:Loaded the model in 86.58 seconds.
Traceback (most recent call last):
File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
shared.model.generate(**kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
return self.dataset[f"{self.prefix}{key}"]
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 1.69 seconds (0.00 tokens/s, 0 tokens, context 12, seed 986507526)
Traceback (most recent call last):
File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
shared.model.generate(**kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
return self.dataset[f"{self.prefix}{key}"]
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 0.55 seconds (0.00 tokens/s, 0 tokens, context 48, seed 825094012)
Traceback (most recent call last):
File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
shared.model.generate(**kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
return self.dataset[f"{self.prefix}{key}"]
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 0.72 seconds (0.00 tokens/s, 0 tokens, context 48, seed 1019664866)
System Info
Processor Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 3.41 GHz
Installed RAM 16.0 GB
Product ID 00330-50255-55524-AAOEM
System type 64-bit operating system, x64-based processor
Pen and touch No pen or touch input is available for this display
Edition Windows 10 Pro
Version 21H2
Installed on 7/24/2020
OS build 19044.2846
Experience Windows Feature Experience Pack 120.2212.4190.0
NVIDIA GeForce GTX 750 Ti
You have a really not great amount of vram.. try with --pre_layer instead.
@Ph0rk0z I tried run_cmd("python server.py --chat --auto-devices --gpu-memory 3300MiB --pre_layer 3", environment=True)
and am getting the same behavior. Any other ideas? Should I try a different number of pre layers?
I tried --cpu and got a different log error.
bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
ERROR:No model is loaded! Select one in the Model tab.
INFO:Loading 4bit_WizardLM-7B-uncensored-GPTQ...
INFO:Found the following quantized model: models\4bit_WizardLM-7B-uncensored-GPTQ\WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
INFO:Loaded the model in 2.82 seconds.
Traceback (most recent call last):
File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
shared.model.generate(**kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI2\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 426, in forward
quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.qzeros, self.groupsize)
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "G:\\F\\Projects\\AI\\text-generation-webui\\GPTQ\\venv\\env\\lib\\site-packages\\torch\\include\\c10/cuda/impl/CUDAGuardImpl.h":25, please report a bug to PyTorch.
Output generated in 1.06 seconds (0.00 tokens/s, 0 tokens, context 8, seed 843130022)
Don't set GPU memory with pre-layer. I'm not sure that GPTQ can run with CPU.
Use GGML for CPU inference. Try the WizardLM-7B-uncensored.ggml.q4_0 model. Just create a folder in your models folder called WizardLM-7B-uncensored-GGML
and download: https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/resolve/previous_llama/WizardLM-7B-uncensored.ggml.q4_0.bin
into it. Then start the server with --cpu --chat --model-menu
and select the new model. The first question is always slowest because the character's context has to be passed to the model. Subsequent questions will be faster.
Don't set GPU memory with pre-layer. I'm not sure that GPTQ can run with CPU.
I removed GPU memory and ended up with the same result.
Try the WizardLM-7B-uncensored.ggml.q4_0 model.
I downloaded the model and tried to load it. Here is the result:
bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Downloading the model to models\TheBloke_WizardLM-7B-uncensored-GGML
100%|██████████████████████████████████████████████████████████████████████████████████████████| 3.86k /3.86k 966kiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 4.21G /4.21G 28.6MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 4.63G /4.63G 29.2MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 5.06G /5.06G 29.1MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 7.58G /7.58G 29.3MiB/s
INFO:Loading TheBloke_WizardLM-7B-uncensored-GGML...
INFO:llama.cpp weights detected: models\TheBloke_WizardLM-7B-uncensored-GGML\WizardLM-7B-uncensored.ggml.q4_0.bin
llama.cpp: loading model from models\TheBloke_WizardLM-7B-uncensored-GGML\WizardLM-7B-uncensored.ggml.q4_0.bin
error loading model: unknown (magic, version) combination: 67676a74, 00000002; is this really a GGML file?
llama_init_from_file: failed to load model
In the WebUI I see:
Traceback (most recent call last):
File “D:\AI2\oobabooga_windows\text-generation-webui\[server.py](http://server.py/)”, line 67, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “D:\AI2\oobabooga_windows\text-generation-webui\modules\[models.py](http://models.py/)”, line 142, in load_model
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File “D:\AI2\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py”, line 32, in from_pretrained
self.model = Llama(**params)
File “D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\[llama.py](http://llama.py/)”, line 159, in init
assert self.ctx is not None
AssertionError
On another front, I keep trying to run the update_windows.bat
and I keep seeing CUDA errors where I've been using ChatGPT to make suggestions. I installed CUDA 11.7, Visual Studio, and updated my NVIDIA drivers based on these messages, but CUDA still has trouble updating.
I guess I'm going to keep experimenting. All guidance welcome.
Did you make sure to download only the model I suggested, the one in the previous_llama branch of the repo? The error you're getting suggests you downloaded the files from the main branch. The newer GGML models require a much more recent version of llama.cpp, one which isn't part of the web-ui yet.
Ahhhhh... I see. I'll give that a go. I didn't realize the difference. Thanks! :)
That model worked for me! Thanks! So I need to focus on GGML versions that are previous versions of llama. I guess I'm in need of a guide that helps me pick compatible models based on the current web-ui.
That's great! Glad you could get it working.
Here are more models that work with the current version of llama.cpp:
Also see: https://github.com/oobabooga/text-generation-webui/issues/2020#issuecomment-1546656696
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.