GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Fixing Triton -"Unexpected MMA layout version found" for prevolta GPUs raises new problems
Got the same error as issue 142 - AttributeError: module ‘triton.compiler’ has no attribute ‘OutOfResources’- after @geekypathak21's solution(see PR 1505) on getting around the problem of matmul issue of prevolta nvidia gpus with Triton library.
Can you provide steps to reproduce the issue?
I got the same issue, my logtrace is:
INFO:Found the following quantized model: models/Aitrepreneur_stable-vicuna-13B-GPTQ-4bit-128g/stable-vicuna-13B-GPTQ-4bit.no-act-order.safetensors
INFO:Using the following device map for the quantized model:
INFO:Loaded the model in 2.55 seconds.
/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py:1405: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
warnings.warn(
Traceback (most recent call last):
File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 72, in _bench
return triton.testing.do_bench(kernel_call, percentiles=(0.5, 0.2, 0.8), rep=40)
TypeError: do_bench() got an unexpected keyword argument 'percentiles'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/USER/git_projects/text-generation-webui/modules/callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/USER/git_projects/text-generation-webui/modules/text_generation.py", line 251, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
return self.sample(
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
outputs = self(
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/quant_linear.py", line 375, in forward
out = QuantLinearFunction.apply(x.reshape(-1, x.shape[-1]), self.qweight, self.scales, self.qzeros, self.g_idx, self.bits, self.maxq)
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/quant_linear.py", line 287, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/quant_linear.py", line 267, in matmul248
matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in run
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in <dictcomp>
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 73, in _bench
except triton.compiler.OutOfResources:
AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'
I gave the model 3584MiB for vram and 32768MiB of ram.
This is me too. And I have 24gb of ram and 96 of system ram.. I am not out of ram.
did anyone find a solution to this?
AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'
I reinstalled triton==2.0.0 which solved the problem