text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Says, "Is Typing..." But doesn't and resets

Open MikhaelLoo opened this issue 1 year ago • 10 comments

Describe the bug

It's trying to chat with me, but can't get out a single word and then clears the screen and starts over. The first time it tries sometimes takes a while, but subsequent attempts are really fast. (as if generating input is actually the Clear History button)

My machine is old, but I was hoping I could get by with slow performance. I'm not sure what the reset behavior means.

I've tried 3 or 4 models with various settings but this behavior is consistent.

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

Install. Set Flags: run_cmd("python server.py --chat --auto-devices --gpu-memory 3300MiB", environment=True) # put your flags here!

Run with various Models.

Screenshot

image

Logs

INFO:Gradio HTTP request redirected to localhost :)
bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
ERROR:No model is loaded! Select one in the Model tab.
ERROR:No model is loaded! Select one in the Model tab.
INFO:Loading 4bit_WizardLM-7B-uncensored-GPTQ...
INFO:Found the following quantized model: models\4bit_WizardLM-7B-uncensored-GPTQ\WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
INFO:Using the following device map for the quantized model:
INFO:Loaded the model in 86.58 seconds.

Traceback (most recent call last):
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
    set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 1.69 seconds (0.00 tokens/s, 0 tokens, context 12, seed 986507526)
Traceback (most recent call last):
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
    set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 0.55 seconds (0.00 tokens/s, 0 tokens, context 48, seed 825094012)
Traceback (most recent call last):
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
    set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.26.self_attn.q_proj.wf1'
Output generated in 0.72 seconds (0.00 tokens/s, 0 tokens, context 48, seed 1019664866)

System Info

Processor	Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz   3.41 GHz
Installed RAM	16.0 GB
Product ID	00330-50255-55524-AAOEM
System type	64-bit operating system, x64-based processor
Pen and touch	No pen or touch input is available for this display

Edition	Windows 10 Pro
Version	21H2
Installed on	‎7/‎24/‎2020
OS build	19044.2846
Experience	Windows Feature Experience Pack 120.2212.4190.0

NVIDIA GeForce GTX 750 Ti

MikhaelLoo avatar May 11 '23 13:05 MikhaelLoo

You have a really not great amount of vram.. try with --pre_layer instead.

Ph0rk0z avatar May 11 '23 15:05 Ph0rk0z

@Ph0rk0z I tried run_cmd("python server.py --chat --auto-devices --gpu-memory 3300MiB --pre_layer 3", environment=True)

and am getting the same behavior. Any other ideas? Should I try a different number of pre layers?

MikhaelLoo avatar May 11 '23 15:05 MikhaelLoo

I tried --cpu and got a different log error.

bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
ERROR:No model is loaded! Select one in the Model tab.
INFO:Loading 4bit_WizardLM-7B-uncensored-GPTQ...
INFO:Found the following quantized model: models\4bit_WizardLM-7B-uncensored-GPTQ\WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
INFO:Loaded the model in 2.82 seconds.

Traceback (most recent call last):
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI2\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 426, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.qzeros, self.groupsize)
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "G:\\F\\Projects\\AI\\text-generation-webui\\GPTQ\\venv\\env\\lib\\site-packages\\torch\\include\\c10/cuda/impl/CUDAGuardImpl.h":25, please report a bug to PyTorch.
Output generated in 1.06 seconds (0.00 tokens/s, 0 tokens, context 8, seed 843130022)

MikhaelLoo avatar May 11 '23 23:05 MikhaelLoo

Don't set GPU memory with pre-layer. I'm not sure that GPTQ can run with CPU.

Ph0rk0z avatar May 12 '23 15:05 Ph0rk0z

Use GGML for CPU inference. Try the WizardLM-7B-uncensored.ggml.q4_0 model. Just create a folder in your models folder called WizardLM-7B-uncensored-GGML and download: https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/resolve/previous_llama/WizardLM-7B-uncensored.ggml.q4_0.bin into it. Then start the server with --cpu --chat --model-menu and select the new model. The first question is always slowest because the character's context has to be passed to the model. Subsequent questions will be faster.

m-spangenberg avatar May 13 '23 11:05 m-spangenberg

Don't set GPU memory with pre-layer. I'm not sure that GPTQ can run with CPU.

I removed GPU memory and ended up with the same result.

Try the WizardLM-7B-uncensored.ggml.q4_0 model.

I downloaded the model and tried to load it. Here is the result:

bin D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Downloading the model to models\TheBloke_WizardLM-7B-uncensored-GGML
100%|██████████████████████████████████████████████████████████████████████████████████████████| 3.86k /3.86k  966kiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 4.21G /4.21G  28.6MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 4.63G /4.63G  29.2MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 5.06G /5.06G  29.1MiB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████| 7.58G /7.58G  29.3MiB/s
INFO:Loading TheBloke_WizardLM-7B-uncensored-GGML...
INFO:llama.cpp weights detected: models\TheBloke_WizardLM-7B-uncensored-GGML\WizardLM-7B-uncensored.ggml.q4_0.bin

llama.cpp: loading model from models\TheBloke_WizardLM-7B-uncensored-GGML\WizardLM-7B-uncensored.ggml.q4_0.bin
error loading model: unknown (magic, version) combination: 67676a74, 00000002; is this really a GGML file?
llama_init_from_file: failed to load model

In the WebUI I see:

Traceback (most recent call last):
File “D:\AI2\oobabooga_windows\text-generation-webui\[server.py](http://server.py/)”, line 67, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “D:\AI2\oobabooga_windows\text-generation-webui\modules\[models.py](http://models.py/)”, line 142, in load_model
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File “D:\AI2\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py”, line 32, in from_pretrained
self.model = Llama(**params)
File “D:\AI2\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\[llama.py](http://llama.py/)”, line 159, in init
assert self.ctx is not None
AssertionError

On another front, I keep trying to run the update_windows.bat and I keep seeing CUDA errors where I've been using ChatGPT to make suggestions. I installed CUDA 11.7, Visual Studio, and updated my NVIDIA drivers based on these messages, but CUDA still has trouble updating.

I guess I'm going to keep experimenting. All guidance welcome.

MikhaelLoo avatar May 13 '23 15:05 MikhaelLoo

Did you make sure to download only the model I suggested, the one in the previous_llama branch of the repo? The error you're getting suggests you downloaded the files from the main branch. The newer GGML models require a much more recent version of llama.cpp, one which isn't part of the web-ui yet.

m-spangenberg avatar May 13 '23 16:05 m-spangenberg

Ahhhhh... I see. I'll give that a go. I didn't realize the difference. Thanks! :)

MikhaelLoo avatar May 13 '23 16:05 MikhaelLoo

That model worked for me! Thanks! So I need to focus on GGML versions that are previous versions of llama. I guess I'm in need of a guide that helps me pick compatible models based on the current web-ui.

image

MikhaelLoo avatar May 13 '23 16:05 MikhaelLoo

That's great! Glad you could get it working.

Here are more models that work with the current version of llama.cpp:

Also see: https://github.com/oobabooga/text-generation-webui/issues/2020#issuecomment-1546656696

m-spangenberg avatar May 13 '23 16:05 m-spangenberg

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar Jun 12 '23 23:06 github-actions[bot]