text-generation-webui
text-generation-webui copied to clipboard
KeyError: 'model.layers.21.self_attn.rotary_emb.cos_cached'
Describe the bug
Hello, everytime I try to chat it says "Is typing..." and then it restarts the conversation.
The model I'm trying to use is GPT-4 x Alpaca, I followed this tutorial (it is for Windows though): https://youtu.be/nVC9D9fRyNU
I also cloned this repo to a folder called "repositories" inside the text-generation-webui folder because I was getting ModuleNotFoundError: No module named 'llama_inference_offload': https://github.com/qwopqwop200/GPTQ-for-LLaMa
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Start with python server.py --auto-device --chat --wbits 4 --groupsize 128 --gpu-memory 4
Screenshot
No response
Logs
Traceback (most recent call last):
File "/mnt/Datos/sym-adriano/text-generation-webui/modules/callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/mnt/Datos/sym-adriano/text-generation-webui/modules/text_generation.py", line 220, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
return self.sample(
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
outputs = self(
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 203, in forward
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 280, in pre_forward
set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/utils/offload.py", line 123, in __getitem__
return self.dataset[f"{self.prefix}{key}"]
File "/home/adriano/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/utils/offload.py", line 170, in __getitem__
weight_info = self.index[key]
KeyError: 'model.layers.21.self_attn.rotary_emb.cos_cached'
Output generated in 6.38 seconds (0.00 tokens/s, 0 tokens, context 43)
System Info
OS: Arch Linux
CPU: AMD Ryzen 5 2600
GPU: NVIDIA GeForce RTX 2060
RAM: 16 GB
Almost the same issue for me as well. I didnt have the clone issues though. Model shows right up for selection and seems to load properly, but when sending it a query, it fails with this error and the UI resets just like OP.
Start with python server.py --auto-device --chat --wbits 4 --groupsize 128 --gpu-memory 5500MiB --disk --no-cache
Traceback (most recent call last):
File "C:\Users\nigel\Development\ai\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback
shared.model.generate(**kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
return self.dataset[f"{self.prefix}{key}"]
File "C:\Users\nigel\Development\ai\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
weight_info = self.index[key]
KeyError: 'model.layers.29.self_attn.q_proj.wf1'
Output generated in 7.45 seconds (0.00 tokens/s, 0 tokens, context 43)
OS: Windows 11
CPU: AMD Ryzen 9 4900HS
GPU: NVIDIA GeForce RTX 2060
RAM: 24 GB
Also tried with the docker image in WSL2. Docker compose came right up, but on the first action trying to generate some text data was hit with this error:
text-generation-webui-text-generation-webui-1 | Loading model ...
text-generation-webui-text-generation-webui-1 | Done.
text-generation-webui-text-generation-webui-1 | Using the following device map for the 4-bit model: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 'cpu', 'model.layers.22': 'cpu', 'model.layers.23': 'cpu', 'model.layers.24': 'cpu', 'model.layers.25': 'cpu', 'model.layers.26': 'cpu', 'model.layers.27': 'cpu', 'model.layers.28': 'cpu', 'model.layers.29': 'cpu', 'model.layers.30': 'cpu', 'model.layers.31': 'cpu', 'model.layers.32': 'cpu', 'model.layers.33': 'cpu', 'model.layers.34': 'cpu', 'model.layers.35': 'cpu', 'model.layers.36': 'cpu', 'model.layers.37': 'cpu', 'model.layers.38': 'cpu', 'model.layers.39': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'}
text-generation-webui-text-generation-webui-1 | Loaded the model in 54.83 seconds.
text-generation-webui-text-generation-webui-1 | Running on local URL: http://0.0.0.0:7860
text-generation-webui-text-generation-webui-1 |
text-generation-webui-text-generation-webui-1 | To create a public link, set `share=True` in `launch()`.
text-generation-webui-text-generation-webui-1 | Traceback (most recent call last):
text-generation-webui-text-generation-webui-1 | File "/app/modules/callbacks.py", line 66, in gentask
text-generation-webui-text-generation-webui-1 | ret = self.mfunc(callback=_callback, **self.kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/modules/text_generation.py", line 230, in generate_with_callback
text-generation-webui-text-generation-webui-1 | shared.model.generate(**kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
text-generation-webui-text-generation-webui-1 | return func(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
text-generation-webui-text-generation-webui-1 | return self.sample(
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
text-generation-webui-text-generation-webui-1 | outputs = self(
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
text-generation-webui-text-generation-webui-1 | outputs = self.model(
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
text-generation-webui-text-generation-webui-1 | layer_outputs = decoder_layer(
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
text-generation-webui-text-generation-webui-1 | hidden_states, self_attn_weights, present_key_value = self.self_attn(
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
text-generation-webui-text-generation-webui-1 | query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 160, in new_forward
text-generation-webui-text-generation-webui-1 | args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 280, in pre_forward
text-generation-webui-text-generation-webui-1 | set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/utils/offload.py", line 123, in __getitem__
text-generation-webui-text-generation-webui-1 | return self.dataset[f"{self.prefix}{key}"]
text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/utils/offload.py", line 170, in __getitem__
text-generation-webui-text-generation-webui-1 | weight_info = self.index[key]
text-generation-webui-text-generation-webui-1 | KeyError: 'model.layers.21.self_attn.q_proj.wf1'
text-generation-webui-text-generation-webui-1 | Output generated in 3.61 seconds (0.00 tokens/s, 0 tokens, context 59, seed 1873277368)
Same here.
NVIDIA RTX 3070 Laptop GPU
To create a public link, set share=Trueinlaunch(). Loading None... Loading vicuna-13b-GPTQ-4bit-128g... Found the following quantized model: models\vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors Loading model ... Done. Using the following device map for the quantized model: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 0, 'model.layers.37': 0, 'model.layers.38': 'cpu', 'model.layers.39': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'} Loaded the model in 19.53 seconds. Traceback (most recent call last): File "C:\Code\oobabooga-windows\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask ret = self.mfunc(callback=_callback, **self.kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 252, in generate_with_callback shared.model.generate(**kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate return self.sample( File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample outputs = self( File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward outputs = self.model( File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward layer_outputs = decoder_layer( File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name]) File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__ return self.dataset[f"{self.prefix}{key}"] File "C:\Code\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__ weight_info = self.index[key] KeyError: 'model.layers.38.self_attn.q_proj.wf1' Output generated in 3.69 seconds (0.00 tokens/s, 0 tokens, context 38, seed 186524734)
Solved by raising pre_layer GPTQ parameter to 35 and selecting llama as Model. Generation is pretty slow though, a word or two a second.
Did the same thing, but I wasn't able to use 35, had to use 28 and it was painfully slow 0.03 tps using the alpaca model
I'm getting same issue with RTX 3060 and LLaMa-Storytelling-30B-4Bit-128g.
KeyError: 'model.layers.37.self_attn.rotary_emb.cos_cached'
How did you guessed number of pre_layer-ed layers?
I'm getting same issue with RTX 3060 and LLaMa-Storytelling-30B-4Bit-128g.
KeyError: 'model.layers.37.self_attn.rotary_emb.cos_cached'
How did you guessed number of pre_layer-ed layers?
trial and error. started at 35 and reduced it till it started working.
Same problem trying to run the llama-65b-4bit-128g model (I have 1TB RAM). At first the GPU ran out of memory, so I reduced the memory there, then I ran into this problems. Pre-Layers so far stuck for me, the autotune cache phase throws an error, too.
This is what I am getting, no matter what pre_layer i use:
Traceback (most recent call last): File “G:\AI\oobabooga_windows\text-generation-webui[server.py](http://server.py/)”, line 102, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “G:\AI\oobabooga_windows\text-generation-webui\modules[models.py](http://models.py/)”, line 158, in load_model model = load_quantized(model_name) File “G:\AI\oobabooga_windows\text-generation-webui\modules\GPTQ_loader.py”, line 173, in load_quantized model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer) File “G:\AI\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py”, line 228, in load_quant model.load_state_dict(torch.load(checkpoint)) File “G:\AI\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules[module.py](http://module.py/)”, line 2041, in load_state_dict raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format( RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM: Missing key(s) in state_dict: “model.layers.0.self_attn.k_proj.bias”, “model.layers.0.self_attn.o_proj.bias”, “model.layers.0.self_attn.q_proj.bias”, “model.layers.0.self_attn.v_proj.bias”, “model.layers.0.mlp.down_proj.bias”, ...
Does anyone got an idea whats going on here?
I can't even get the webui to load when I use pre_layer with any value mentioned here & even lower values. Back to the out of memory CUDA torch.cuda.OutOfMemoryError: CUDA out of memory..
My run cmd from trying to fix multiple issues: python server.py --wbits 4 --groupsize 128 --model_type Llama --auto-devices --gpu-memory 4000MiB --pre_layer 28 --chat --disk --no-cache. (had --load-in-8bit also at 1 point)
I read 1 guy outright said that trying to run TheBloke_stable-vicuna-13B-GPTQ on a RTX 3060 (laptop) with 6GB VRAM is no go. I need a 7B model or so.
Without the --pre_layer x I can get it to boot. No replies though... just sits on "..." then nothing.
How did you guessed number of pre_layer-ed layers?
GPU 6GB ~ 20..25 pre_layers GPU 8GB ~ 30..35 pre_layers
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.