GPTQ-for-LLaMa Fixing Triton -"Unexpected MMA layout version found" for prevolta GPUs raises new problems

Got the same error as issue 142 - AttributeError: module ‘triton.compiler’ has no attribute ‘OutOfResources’- after @geekypathak21's solution(see PR 1505) on getting around the problem of matmul issue of prevolta nvidia gpus with Triton library.

Apr 14 '23 09:04 DragonLiu1995

Can you provide steps to reproduce the issue?

Apr 18 '23 02:04 clxyder

I got the same issue, my logtrace is:

INFO:Found the following quantized model: models/Aitrepreneur_stable-vicuna-13B-GPTQ-4bit-128g/stable-vicuna-13B-GPTQ-4bit.no-act-order.safetensors
INFO:Using the following device map for the quantized model:
INFO:Loaded the model in 2.55 seconds.
/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py:1405: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Traceback (most recent call last):
  File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 72, in _bench
    return triton.testing.do_bench(kernel_call, percentiles=(0.5, 0.2, 0.8), rep=40)
TypeError: do_bench() got an unexpected keyword argument 'percentiles'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/USER/git_projects/text-generation-webui/modules/callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/USER/git_projects/text-generation-webui/modules/text_generation.py", line 251, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/quant_linear.py", line 375, in forward
    out = QuantLinearFunction.apply(x.reshape(-1, x.shape[-1]), self.qweight, self.scales, self.qzeros, self.g_idx, self.bits, self.maxq)
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/USER/git_projects/text-generation-webui/venv/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/quant_linear.py", line 287, in forward
    output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/quant_linear.py", line 267, in matmul248
    matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
  File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in run
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/home/USER/git_projects/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 73, in _bench
    except triton.compiler.OutOfResources:
AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'

I gave the model 3584MiB for vram and 32768MiB of ram.

May 06 '23 11:05 fcolecumberri

This is me too. And I have 24gb of ram and 96 of system ram.. I am not out of ram.

May 13 '23 12:05 Ph0rk0z

did anyone find a solution to this?

AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'

Jun 20 '23 13:06 psinger

I reinstalled triton==2.0.0 which solved the problem

Jul 19 '23 12:07 yds1024

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Fixing Triton -"Unexpected MMA layout version found" for prevolta GPUs raises new problems

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard