text-generation-webui server.py not starting with GPTQ latest git 534edc7

server.py not starting with GPTQ latest git 534edc7

Open alexl83 opened this issue 1 year ago • 34 comments

Describe the bug

launching latest text-generation-webui code with latest opqwop200 / GPTQ-for-LLaMa throws up a python error:


Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 98, in load_model
    from modules.GPTQ_loader import load_quantized
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 11, in <module>
    import opt
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/opt.py", line 424
    model = load_quant(args.model, args.load, args.wbits, args.groupsize))
                                                                         ^
SyntaxError: unmatched ')'

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

cd text-generation-webui/repositories/GPTQ-for-LLaMa
git pull
pip install -r requirements.txt
python setup_cuda.py install
cd ../..
python server.py --auto-devices --gpu-memory 16 --gptq-bits 4 --cai-chat --listen --extensions gallery llama_prompts --model llama-30b --settings ~/oobabooga/settings.json

Screenshot

No response

Logs

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 98, in load_model
    from modules.GPTQ_loader import load_quantized
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 11, in <module>
    import opt
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/opt.py", line 424
    model = load_quant(args.model, args.load, args.wbits, args.groupsize))
                                                                         ^
SyntaxError: unmatched ')'



### System Info

```shell
Ryzen 7700X
RTX 4090
Ubuntu 22.10 amd64
micromamba environment
python 3.10.9
pytorch 1.13.1
torchaudio 0.13.1
torchvision 0.14.1

Mar 19 '23 21:03 alexl83

As it seems, 'load_quant()' in 'modules/GPTQ_loader.py' needs to pass one more (new) positional argument to qwopqwop200 / GPTQ-for-LLaMa: 'groupsize'

after correcting SyntaxError, here's the trace:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 100, in load_model
    model = load_quantized(model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'

Mar 19 '23 22:03 alexl83

Change model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)to model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)

From the args documentation -1 sets to default size.

Mar 19 '23 22:03 MillionthOdin16

Thanks, passing the value triggers another exception:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Loading model ...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 100, in load_model
    model = load_quantized(model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 246, in load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict: "model.layers.0.self_attn.q_proj.qzeros", ...

Mar 19 '23 22:03 alexl83

Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today and it works fine. I saw same error as you before the rollback.

Mar 19 '23 22:03 MillionthOdin16

Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today and it works fine.

Will do the same for now; I'd be curious to understand if re-quantizing the models with today's code would fix the loading Thanks for helping out! :)

Mar 19 '23 22:03 alexl83

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)

git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4

Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

Mar 19 '23 22:03 RedTopper

I actually don't know anymore... It seems like it might be more broken than I thought. I'm using the pre-quantized models from HF, so you might be right about versions alex.

(text-generation-webui) PS text-generation-webui> python server.py --model llama-7b --load-in-4bit  --auto-devices   
Warning: --load-in-4bit is deprecated and will be removed. Use --gptq-bits 4 instead.

Loading llama-7b...
Loading model ...
Done.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Loaded the model in 2.71 seconds.
text-generation-webui\lib\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
  File "text-generation-webui\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "text-generation-webui\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "text-generation-webui\modules\callbacks.py", line 65, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "text-generation-webui\modules\text_generation.py", line 199, in generate_with_callback
    shared.model.generate(**kwargs)
  File "text-generation-webui\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
    return self.sample(
  File "text-generation-webui\lib\site-packages\transformers\generation\utils.py", line 2468, in sample
    outputs = self(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 198, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: int) -> None

Invoked with: tensor([[ 0.0436, -0.0149,  0.0150,  ...,  0.0267,  0.0112, -0.0011],
        [ 0.0032, -0.0213,  0.0215,  ...,  0.0320, -0.0013, -0.0199],
        [-0.0021,  0.0065, -0.0123,  ...,  0.0199, -0.0018, -0.0081],
        ...,
        [ 0.0074,  0.0389,  0.0164,  ..., -0.0429, -0.0018, -0.0133],
        [ 0.0305,  0.0061,  0.0262,  ...,  0.0096,  0.0096,  0.0033],
        [-0.0431, -0.0260,  0.0012,  ...,  0.0075, -0.0076, -0.0037]],
       device='cuda:0'), tensor([[ 2004248423,  2020046951,  1734903431,  ..., -2024113529,
         -1772648858,  1988708488],
        [ 2004318071,  1985447543,  1719101303,  ...,  1738958728,
          1734834296,  1988584549],
        [-2006481289, -2038991241,  2003200134,  ..., -1734780278,
         -2055714936, -1401572265],
        ...,
        [-2022213769, -2021226889,  1735947895,  ...,  2002357398,
          1483176039, -1215859063],
        [ 2005366614, -2022148249,  1752733576,  ...,   394557864,
          1986418055,  1483962710],
        [ 1735820935,  1988720743, -2056755593,  ..., -1468438152,
          1718123383,  1150911352]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0318],
        [0.0154],
        [0.0123],
        ...,
        [0.0191],
        [0.0206],
        [0.0137]], device='cuda:0'), tensor([[0.2229],
        [0.1079],
        [0.0860],
        ...,
        [0.1529],
        [0.1439],
        [0.0960]], device='cuda:0')

Mar 19 '23 22:03 MillionthOdin16

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

Mar 19 '23 22:03 MillionthOdin16

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

This solves it for me.

This bug report is in the wrong repository, by the way. You should tell @qwopqwop200 about it.

Mar 19 '23 22:03 oobabooga

Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

Yes, it's working for me with that specific commit.

Specificially, it's set up like this right now: https://github.com/RedTopper/Text-Generation-Webui-Podman/blob/main/Containerfile#L14-L15

Mar 19 '23 22:03 RedTopper

Awesome. Thanks

Mar 19 '23 22:03 MillionthOdin16

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4 It's what I'm using for my container at the moment.
Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

Prediction broken for me too with yday's commit:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Loading model ...
Done.
Loaded the model in 6.81 seconds.
Loading the extension "gallery"... Ok.
Loading the extension "llama_prompts"... Ok.
/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/callbacks.py", line 65, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/text_generation.py", line 201, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/quant.py", line 198, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
    1. (arg0: at::Tensor, arg1: at::Tensor, arg2: at::Tensor, arg3: at::Tensor, arg4: at::Tensor, arg5: int) -> None

Invoked with: tensor([[-0.0500, -0.0130, -0.0012,  ...,  0.0039, -0.0046, -0.0232],
        [-0.0420,  0.0025, -0.0313,  ..., -0.0309,  0.0211, -0.0179],
        [-0.0116,  0.0273,  0.0387,  ...,  0.0043, -0.0025,  0.0179],
        ...,
        [-0.0071, -0.0465, -0.0059,  ...,  0.0018,  0.0062, -0.0076],
        [-0.0218,  0.0511, -0.0048,  ...,  0.0093,  0.0003,  0.0119],
        [ 0.0235, -0.0288, -0.0288,  ..., -0.0232, -0.0172,  0.0103]],
       device='cuda:0'), tensor([[ 1719302009,  2004449128,  1234793881,  ..., -2019973256,
         -1502063032,  2037938296],
        [ 2019915367,  2004252535,  1750500728,  ..., -1736926794,
           965175426, -1465341558],
        [-1753778313, -2005497737, -1215805527,  ..., -2005514360,
          1450617205, -2020972629],
        ...,
        [ 2005431670,  1701348758,  1790806215,  ..., -1967744889,
          1970501769,  2055776885],
        [ 1718114184,  1970689672,  1183483512,  ...,  2053671319,
         -1752840856,  1570348373],
        [ 1734838390,  2022205543,  1734843030,  ..., -1737918327,
          2002028378, -1500927849]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0111],
        [0.0150],
        [0.0077],
        ...,
        [0.0194],
        [0.0119],
        [0.0131]], device='cuda:0'), tensor([[0.0779],
        [0.1051],
        [0.0613],
        ...,
        [0.1551],
        [0.0830],
        [0.1045]], device='cuda:0')

Mar 19 '23 22:03 alexl83

I wonder if they are actually testing on a quantized model, or a non-quantized one. I don't know where to go from here haha

Mar 19 '23 22:03 MillionthOdin16

I 'fixed' inference by:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda.py install

Today's changes break things however

Mar 19 '23 22:03 alexl83

I 'fixed' inference by:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda_py install

Today's changes break things however

I also have the same issue, the last line is not working in your reply.

Mar 19 '23 22:03 iChristGit

I 'fixed' inference by:
cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda_py install
Today's changes break things however
I also have the same issue, the last line is not working in your reply.

fixed typo: python install_cuda.py install

Mar 19 '23 23:03 alexl83

I 'fixed' inference by: <snip>

That would make sense - you need to also rebuild the cuda package with the .cpp files from that commit. The container starts fresh from each build so the compiled version always matches the python code used in the repo.

Mar 19 '23 23:03 RedTopper

Awesome! Worked for me too. I completely forgot to rebuild the kernel -_-

Mar 19 '23 23:03 MillionthOdin16

In any case, I reported https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/62 to qwopqwop200 / GPTQ-for-LLaMa

Mar 19 '23 23:03 alexl83

qwopqwop200 replied, as of today, LLaMA models need to be re-quantized to work with newset code

I'll test and report back ;-)

Mar 19 '23 23:03 alexl83

qwopqwop200 replied, as of today, LLaMA models need to be re-quantized to work with newset code

@zoidbb help?

Mar 19 '23 23:03 oobabooga

Sum up:

latest GPTQ-for-LLaMa code re-quantized HF LLaMA model(s) to 4bit GPTQ Changed models/GPTQ_loader.py model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits) to model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)

works for me, tested with LLaMA-7B and LLaMA-13B Tomorrow I'm going to re-quantize 30B/65B

Mar 19 '23 23:03 alexl83

So this is why I couldn't load the models after I fixed the ) bug.

But now we can quantize in different group size. Which one is the best for performance and coherence? I hate that I have to re-do this, btw.

Mar 20 '23 12:03 Ph0rk0z

Re-quantize means running python llama.py ..\..\models\llama-13b-hf c4 --wbits 2 --groupsize 128 --save ..\..\models\llama13b-2bit.pt from GPTQ-For-Llama?

This requires a ton of VRAM, and I have 2 8GB cards but it only maxes out one cards memory. How can this be done locally? I previously downloaded the decapoda research files.

Edit: nvm, found a 13b model with the lora integrated that loads.

Mar 20 '23 15:03 terbo

@alexl83 Would you be able to host the fixed quantized files somewhere, perhaps on Hugging Face?

Mar 20 '23 20:03 satvikpendem

When recompiling GPTQ on Windows, I accidentally forgot to use the x64 native tools cmd. It then successfully compiled using Visual Studio 2022 on its own, which is interesting considering everyone has been saying that only VS 2019 will work.

Mar 20 '23 21:03 jllllll

I recommend using the previous GPTQ commit for now

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
python setup_cuda.py install

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation

Mar 20 '23 21:03 oobabooga

When recompiling GPTQ on Windows, I accidentally forgot to use the x64 native tools cmd. It then successfully compiled using Visual Studio 2022 on its own, which is interesting considering everyone has been saying that only VS 2019 will work.

I noticed this as well. I was going off of the Reddit thread at the time, but I guess it is wrong.

Mar 21 '23 15:03 xNul

I keep getting: "CUDA Extension not installed." I'm on Windows 11 native. I have used the older commit (git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4) of GPTQ and ensured to install the .whl correctly. Cuda is certainly installed. Running python import torch torch.cuda.is_available() returns true.

This is my first time installing Llama so I'm not sure if this is just a perfect storm of changes happening or what. It appears that the GPTQ_loader.py was changed yesterday to "model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, shared.args.gptq_pre_layer)" see post and yet still doesn't seem to work with the current branch of GPTQ.

Something about requantinization too? No idea what my issue is. I'm sure there is a whole lot more I am missing since I'm just now diving in today.

Mar 21 '23 20:03 KnoBuddy

@KnoBuddy if delete your environment and files and rollback text-generation-webui to two days ago, these instructions I made should work for you. You might be able to replace the python setup_cuda.py install line with installing the .whl. If installing the .whl doesn't work, then try to use the python setup_cuda.py install line. If that returns some compiler missing error, you need to install VS BuildTools like I mention in the instructions.

Mar 21 '23 21:03 xNul

@KnoBuddy "CUDA Extension not installed." is specifically referring to GPTQ-for-LLaMa. I've had this issue before after installing an outdated wheel. I uploaded a Windows wheel yesterday, along with the batch script that I use to install everything above that: https://github.com/oobabooga/text-generation-webui/issues/457#issuecomment-1477075495 Maybe that will work for you, if not I can try compiling a new wheel, but that wheel should work. If you use the batch script, make sure not to run it as admin. If you have issues with permissions and need to run it as admin, add a cd /D command pointing to your current directory just after the first call line. Also, make sure to install the .whl file while it is inside the GPTQ-for-LLaMa folder. I've had issues with it not installing properly outside that folder.

Mar 22 '23 00:03 jllllll

text-generation-webui text-generation-webui copied to clipboard

server.py not starting with GPTQ latest git 534edc7

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

text-generation-webui
text-generation-webui copied to clipboard