text-generation-webui Says, "Is Typing..." But doesn't and resets

Describe the bug

It's trying to chat with me, but can't get out a single word and then clears the screen and starts over. The first time it tries sometimes takes a while, but subsequent attempts are really fast. (as if generating input is actually the Clear History button)

My machine is old, but I was hoping I could get by with slow performance. I'm not sure what the reset behavior means.

I've tried 3 or 4 models with various settings but this behavior is consistent.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Install. Set Flags: run_cmd("python server.py --chat --auto-devices --gpu-memory 3300MiB", environment=True) # put your flags here!

Run with various Models.

Screenshot

Logs

INFO:Gradio HTTP request redirected to localhost :)
bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
ERROR:No model is loaded! Select one in the Model tab.
ERROR:No model is loaded! Select one in the Model tab.
INFO:Loading 4bit_WizardLM-7B-uncensored-GPTQ...
INFO:Found the following quantized model: models\4bit_WizardLM-7B-uncensored-GPTQ\WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
INFO:Using the following device map for the quantized model:
INFO:Loaded the model in 86.58 seconds.

Traceback (most recent call last):
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
    set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 1.69 seconds (0.00 tokens/s, 0 tokens, context 12, seed 986507526)
Traceback (most recent call last):
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
    set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 0.55 seconds (0.00 tokens/s, 0 tokens, context 48, seed 825094012)
Traceback (most recent call last):
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
    set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 0.72 seconds (0.00 tokens/s, 0 tokens, context 48, seed 1019664866)

System Info

Processor	Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz   3.41 GHz
Installed RAM	16.0 GB
Product ID	00330-50255-55524-AAOEM
System type	64-bit operating system, x64-based processor
Pen and touch	No pen or touch input is available for this display

Edition	Windows 10 Pro
Version	21H2
Installed on	‎7/‎24/‎2020
OS build	19044.2846
Experience	Windows Feature Experience Pack 120.2212.4190.0

NVIDIA GeForce GTX 750 Ti

May 11 '23 13:05 MikhaelLoo

You have a really not great amount of vram.. try with --pre_layer instead.

May 11 '23 15:05 Ph0rk0z

@Ph0rk0z I tried run_cmd("python server.py --chat --auto-devices --gpu-memory 3300MiB --pre_layer 3", environment=True)

and am getting the same behavior. Any other ideas? Should I try a different number of pre layers?

May 11 '23 15:05 MikhaelLoo

I tried --cpu and got a different log error.

bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
ERROR:No model is loaded! Select one in the Model tab.
INFO:Loading 4bit_WizardLM-7B-uncensored-GPTQ...
INFO:Found the following quantized model: models\4bit_WizardLM-7B-uncensored-GPTQ\WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
INFO:Loaded the model in 2.82 seconds.

Traceback (most recent call last):
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 426, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.qzeros, self.groupsize)
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "G:\\F\\Projects\\AI\\text-generation-webui\\GPTQ\\venv\\env\\lib\\site-packages\\torch\\include\\c10/cuda/impl/CUDAGuardImpl.h":25, please report a bug to PyTorch.
Output generated in 1.06 seconds (0.00 tokens/s, 0 tokens, context 8, seed 843130022)

May 11 '23 23:05 MikhaelLoo

Don't set GPU memory with pre-layer. I'm not sure that GPTQ can run with CPU.

May 12 '23 15:05 Ph0rk0z

Use GGML for CPU inference. Try the WizardLM-7B-uncensored.ggml.q4_0 model. Just create a folder in your models folder called WizardLM-7B-uncensored-GGML and download: https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/resolve/previous_llama/WizardLM-7B-uncensored.ggml.q4_0.bin into it. Then start the server with --cpu --chat --model-menu and select the new model. The first question is always slowest because the character's context has to be passed to the model. Subsequent questions will be faster.

May 13 '23 11:05 m-spangenberg

Don't set GPU memory with pre-layer. I'm not sure that GPTQ can run with CPU.

I removed GPU memory and ended up with the same result.

Try the WizardLM-7B-uncensored.ggml.q4_0 model.

I downloaded the model and tried to load it. Here is the result:

bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Downloading the model to models\TheBloke_WizardLM-7B-uncensored-GGML
100%|██████████████████████████████████████████████████████████████████████████████████████████| 3.86k /3.86k  966kiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 4.21G /4.21G  28.6MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 4.63G /4.63G  29.2MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 5.06G /5.06G  29.1MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 7.58G /7.58G  29.3MiB/s
INFO:Loading TheBloke_WizardLM-7B-uncensored-GGML...
INFO:llama.cpp weights detected: models\TheBloke_WizardLM-7B-uncensored-GGML\WizardLM-7B-uncensored.ggml.q4_0.bin

llama.cpp: loading model from models\TheBloke_WizardLM-7B-uncensored-GGML\WizardLM-7B-uncensored.ggml.q4_0.bin
error loading model: unknown (magic, version) combination: 67676a74, 00000002; is this really a GGML file?
llama_init_from_file: failed to load model

In the WebUI I see:

Traceback (most recent call last):
File “D:\AI2\oobabooga_windows\text-generation-webui\[server.py](http://server.py/)”, line 67, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “D:\AI2\oobabooga_windows\text-generation-webui\modules\[models.py](http://models.py/)”, line 142, in load_model
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File “D:\AI2\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py”, line 32, in from_pretrained
self.model = Llama(**params)
File “D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\[llama.py](http://llama.py/)”, line 159, in init
assert self.ctx is not None
AssertionError

On another front, I keep trying to run the update_windows.bat and I keep seeing CUDA errors where I've been using ChatGPT to make suggestions. I installed CUDA 11.7, Visual Studio, and updated my NVIDIA drivers based on these messages, but CUDA still has trouble updating.

I guess I'm going to keep experimenting. All guidance welcome.

May 13 '23 15:05 MikhaelLoo

Did you make sure to download only the model I suggested, the one in the previous_llama branch of the repo? The error you're getting suggests you downloaded the files from the main branch. The newer GGML models require a much more recent version of llama.cpp, one which isn't part of the web-ui yet.

May 13 '23 16:05 m-spangenberg

Ahhhhh... I see. I'll give that a go. I didn't realize the difference. Thanks! :)

May 13 '23 16:05 MikhaelLoo

That model worked for me! Thanks! So I need to focus on GGML versions that are previous versions of llama. I guess I'm in need of a guide that helps me pick compatible models based on the current web-ui.

May 13 '23 16:05 MikhaelLoo

That's great! Glad you could get it working.

Here are more models that work with the current version of llama.cpp:

Also see: https://github.com/oobabooga/text-generation-webui/issues/2020#issuecomment-1546656696

May 13 '23 16:05 m-spangenberg

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

Jun 12 '23 23:06 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

Says, "Is Typing..." But doesn't and resets

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard