text-generation-webui
text-generation-webui copied to clipboard
server.py not starting with GPTQ latest git 534edc7
Describe the bug
launching latest text-generation-webui code with latest opqwop200 / GPTQ-for-LLaMa throws up a python error:
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 98, in load_model
from modules.GPTQ_loader import load_quantized
File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 11, in <module>
import opt
File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/opt.py", line 424
model = load_quant(args.model, args.load, args.wbits, args.groupsize))
^
SyntaxError: unmatched ')'
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
cd text-generation-webui/repositories/GPTQ-for-LLaMa
git pull
pip install -r requirements.txt
python setup_cuda.py install
cd ../..
python server.py --auto-devices --gpu-memory 16 --gptq-bits 4 --cai-chat --listen --extensions gallery llama_prompts --model llama-30b --settings ~/oobabooga/settings.json
Screenshot
No response
Logs
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 98, in load_model
from modules.GPTQ_loader import load_quantized
File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 11, in <module>
import opt
File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/opt.py", line 424
model = load_quant(args.model, args.load, args.wbits, args.groupsize))
^
SyntaxError: unmatched ')'
### System Info
```shell
Ryzen 7700X
RTX 4090
Ubuntu 22.10 amd64
micromamba environment
python 3.10.9
pytorch 1.13.1
torchaudio 0.13.1
torchvision 0.14.1
As it seems, 'load_quant()' in 'modules/GPTQ_loader.py' needs to pass one more (new) positional argument to qwopqwop200 / GPTQ-for-LLaMa: 'groupsize'
after correcting SyntaxError, here's the trace:
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 100, in load_model
model = load_quantized(model_name)
File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'
Change model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
to
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)
From the args documentation -1 sets to default size.
Thanks, passing the value triggers another exception:
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Loading model ...
Traceback (most recent call last):
File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 100, in load_model
model = load_quantized(model_name)
File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)
File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 246, in load_quant
model.load_state_dict(torch.load(checkpoint))
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.q_proj.qzeros", ...
Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today and it works fine. I saw same error as you before the rollback.
Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today and it works fine.
Will do the same for now; I'd be curious to understand if re-quantizing the models with today's code would fix the loading Thanks for helping out! :)
If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4
It's what I'm using for my container at the moment.
I actually don't know anymore... It seems like it might be more broken than I thought. I'm using the pre-quantized models from HF, so you might be right about versions alex.
(text-generation-webui) PS text-generation-webui> python server.py --model llama-7b --load-in-4bit --auto-devices
Warning: --load-in-4bit is deprecated and will be removed. Use --gptq-bits 4 instead.
Loading llama-7b...
Loading model ...
Done.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Loaded the model in 2.71 seconds.
text-generation-webui\lib\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
File "text-generation-webui\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "text-generation-webui\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "text-generation-webui\modules\callbacks.py", line 65, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "text-generation-webui\modules\text_generation.py", line 199, in generate_with_callback
shared.model.generate(**kwargs)
File "text-generation-webui\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "text-generation-webui\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
return self.sample(
File "text-generation-webui\lib\site-packages\transformers\generation\utils.py", line 2468, in sample
outputs = self(
File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 765, in forward
outputs = self.model(
File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 198, in forward
quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: int) -> None
Invoked with: tensor([[ 0.0436, -0.0149, 0.0150, ..., 0.0267, 0.0112, -0.0011],
[ 0.0032, -0.0213, 0.0215, ..., 0.0320, -0.0013, -0.0199],
[-0.0021, 0.0065, -0.0123, ..., 0.0199, -0.0018, -0.0081],
...,
[ 0.0074, 0.0389, 0.0164, ..., -0.0429, -0.0018, -0.0133],
[ 0.0305, 0.0061, 0.0262, ..., 0.0096, 0.0096, 0.0033],
[-0.0431, -0.0260, 0.0012, ..., 0.0075, -0.0076, -0.0037]],
device='cuda:0'), tensor([[ 2004248423, 2020046951, 1734903431, ..., -2024113529,
-1772648858, 1988708488],
[ 2004318071, 1985447543, 1719101303, ..., 1738958728,
1734834296, 1988584549],
[-2006481289, -2038991241, 2003200134, ..., -1734780278,
-2055714936, -1401572265],
...,
[-2022213769, -2021226889, 1735947895, ..., 2002357398,
1483176039, -1215859063],
[ 2005366614, -2022148249, 1752733576, ..., 394557864,
1986418055, 1483962710],
[ 1735820935, 1988720743, -2056755593, ..., -1468438152,
1718123383, 1150911352]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0318],
[0.0154],
[0.0123],
...,
[0.0191],
[0.0206],
[0.0137]], device='cuda:0'), tensor([[0.2229],
[0.1079],
[0.0860],
...,
[0.1529],
[0.1439],
[0.0960]], device='cuda:0')
If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4
It's what I'm using for my container at the moment.
Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.
If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4
It's what I'm using for my container at the moment.
This solves it for me.
This bug report is in the wrong repository, by the way. You should tell @qwopqwop200 about it.
Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.
Yes, it's working for me with that specific commit.
Specificially, it's set up like this right now: https://github.com/RedTopper/Text-Generation-Webui-Podman/blob/main/Containerfile#L14-L15
Awesome. Thanks
If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4 It's what I'm using for my container at the moment.
Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.
Prediction broken for me too with yday's commit:
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Loading model ...
Done.
Loaded the model in 6.81 seconds.
Loading the extension "gallery"... Ok.
Loading the extension "llama_prompts"... Ok.
/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/alex/oobabooga/text-generation-webui/modules/callbacks.py", line 65, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/alex/oobabooga/text-generation-webui/modules/text_generation.py", line 201, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
outputs = self(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
outputs = self.model(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
layer_outputs = decoder_layer(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/quant.py", line 198, in forward
quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
1. (arg0: at::Tensor, arg1: at::Tensor, arg2: at::Tensor, arg3: at::Tensor, arg4: at::Tensor, arg5: int) -> None
Invoked with: tensor([[-0.0500, -0.0130, -0.0012, ..., 0.0039, -0.0046, -0.0232],
[-0.0420, 0.0025, -0.0313, ..., -0.0309, 0.0211, -0.0179],
[-0.0116, 0.0273, 0.0387, ..., 0.0043, -0.0025, 0.0179],
...,
[-0.0071, -0.0465, -0.0059, ..., 0.0018, 0.0062, -0.0076],
[-0.0218, 0.0511, -0.0048, ..., 0.0093, 0.0003, 0.0119],
[ 0.0235, -0.0288, -0.0288, ..., -0.0232, -0.0172, 0.0103]],
device='cuda:0'), tensor([[ 1719302009, 2004449128, 1234793881, ..., -2019973256,
-1502063032, 2037938296],
[ 2019915367, 2004252535, 1750500728, ..., -1736926794,
965175426, -1465341558],
[-1753778313, -2005497737, -1215805527, ..., -2005514360,
1450617205, -2020972629],
...,
[ 2005431670, 1701348758, 1790806215, ..., -1967744889,
1970501769, 2055776885],
[ 1718114184, 1970689672, 1183483512, ..., 2053671319,
-1752840856, 1570348373],
[ 1734838390, 2022205543, 1734843030, ..., -1737918327,
2002028378, -1500927849]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0111],
[0.0150],
[0.0077],
...,
[0.0194],
[0.0119],
[0.0131]], device='cuda:0'), tensor([[0.0779],
[0.1051],
[0.0613],
...,
[0.1551],
[0.0830],
[0.1045]], device='cuda:0')
I wonder if they are actually testing on a quantized model, or a non-quantized one. I don't know where to go from here haha
I 'fixed' inference by:
cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda.py install
Today's changes break things however
I 'fixed' inference by:
cd repositories/GPTQ-for-LLaMa git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4 pip install -r requirements.txt python install_cuda_py install
Today's changes break things however
I also have the same issue, the last line is not working in your reply.
I 'fixed' inference by:
cd repositories/GPTQ-for-LLaMa git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4 pip install -r requirements.txt python install_cuda_py install
Today's changes break things however
I also have the same issue, the last line is not working in your reply.
fixed typo:
python install_cuda.py install
I 'fixed' inference by: <snip>
That would make sense - you need to also rebuild the cuda package with the .cpp files from that commit. The container starts fresh from each build so the compiled version always matches the python code used in the repo.
Awesome! Worked for me too. I completely forgot to rebuild the kernel -_-
In any case, I reported https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/62 to qwopqwop200 / GPTQ-for-LLaMa
qwopqwop200 replied, as of today, LLaMA models need to be re-quantized to work with newset code
I'll test and report back ;-)
qwopqwop200 replied, as of today, LLaMA models need to be re-quantized to work with newset code
@zoidbb help?
Sum up:
latest GPTQ-for-LLaMa code
re-quantized HF LLaMA model(s) to 4bit GPTQ
Changed models/GPTQ_loader.py
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
to
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)
works for me, tested with LLaMA-7B and LLaMA-13B Tomorrow I'm going to re-quantize 30B/65B
So this is why I couldn't load the models after I fixed the ) bug.
But now we can quantize in different group size. Which one is the best for performance and coherence? I hate that I have to re-do this, btw.
Re-quantize means running python llama.py ..\..\models\llama-13b-hf c4 --wbits 2 --groupsize 128 --save ..\..\models\llama13b-2bit.pt
from GPTQ-For-Llama?
This requires a ton of VRAM, and I have 2 8GB cards but it only maxes out one cards memory. How can this be done locally? I previously downloaded the decapoda research files.
Edit: nvm, found a 13b model with the lora integrated that loads.
@alexl83 Would you be able to host the fixed quantized files somewhere, perhaps on Hugging Face?
When recompiling GPTQ on Windows, I accidentally forgot to use the x64 native tools cmd. It then successfully compiled using Visual Studio 2022 on its own, which is interesting considering everyone has been saying that only VS 2019 will work.
I recommend using the previous GPTQ commit for now
mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
python setup_cuda.py install
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation
When recompiling GPTQ on Windows, I accidentally forgot to use the x64 native tools cmd. It then successfully compiled using Visual Studio 2022 on its own, which is interesting considering everyone has been saying that only VS 2019 will work.
I noticed this as well. I was going off of the Reddit thread at the time, but I guess it is wrong.
I keep getting: "CUDA Extension not installed." I'm on Windows 11 native. I have used the older commit (git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4) of GPTQ and ensured to install the .whl correctly. Cuda is certainly installed. Running python import torch torch.cuda.is_available() returns true.
This is my first time installing Llama so I'm not sure if this is just a perfect storm of changes happening or what. It appears that the GPTQ_loader.py was changed yesterday to "model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, shared.args.gptq_pre_layer)" see post and yet still doesn't seem to work with the current branch of GPTQ.
Something about requantinization too? No idea what my issue is. I'm sure there is a whole lot more I am missing since I'm just now diving in today.
@KnoBuddy if delete your environment and files and rollback text-generation-webui to two days ago, these instructions I made should work for you. You might be able to replace the python setup_cuda.py install
line with installing the .whl. If installing the .whl doesn't work, then try to use the python setup_cuda.py install
line. If that returns some compiler missing error, you need to install VS BuildTools like I mention in the instructions.
@KnoBuddy "CUDA Extension not installed." is specifically referring to GPTQ-for-LLaMa. I've had this issue before after installing an outdated wheel. I uploaded a Windows wheel yesterday, along with the batch script that I use to install everything above that:
https://github.com/oobabooga/text-generation-webui/issues/457#issuecomment-1477075495
Maybe that will work for you, if not I can try compiling a new wheel, but that wheel should work. If you use the batch script, make sure not to run it as admin. If you have issues with permissions and need to run it as admin, add a cd /D
command pointing to your current directory just after the first call line. Also, make sure to install the .whl file while it is inside the GPTQ-for-LLaMa folder. I've had issues with it not installing properly outside that folder.