text-generation-webui RuntimeError, context length buffer size with Transformers loader's exllama backend require extending.

RuntimeError, context length buffer size with Transformers loader's exllama backend require extending.

Open kuronekosaiko opened this issue 7 months ago • 0 comments

Describe the bug

Can't generate more than 3352 tokens with Transformers loader's exllama backend.

When try, a RuntimeError is raised.

According to the error log, this is fixable by calling exllama_set_max_input_length(model, max_input_length=new_input_length), which increase the temp_state buffer size.

This seems only affect gptq models quant with act-order set to true.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

1. Load `Qwen2-7B-GPTQ` with Transformers loader without setting `disable_exllama` and `disable_exllama2`.
2a. Input a longer than 3352 tokens prompt.
2b. Set `max_new_tokens` to be longer than 3352.
3. Hit generate.
4. Check terminal.

Note: Haven't check models other than Qwen2-7B-GPTQ.

Screenshot

No response

Logs

Traceback (most recent call last):

  File "/opt/tgwui/modules/text_generation.py", line 378, in generate_reply_HF

    output = shared.model.generate(**generate_params)[0]

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 1914, in generate

    result = self._sample(

             ^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2651, in _sample

    outputs = self(

              ^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1221, in forward

    outputs = self.model(

              ^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1023, in forward

    layer_outputs = decoder_layer(

                    ^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 777, in forward

    hidden_states = self.mlp(hidden_states)

                    ^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 186, in forward

    return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 185, in forward

    out = ext_q4_matmul(x, self.q4, self.width)

          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 42, in ext_q4_matmul

    q4_matmul(x, q4, output)

RuntimeError: The temp_state buffer is too small in the exllama backend for GPTQ with act-order. Please call the exllama_set_max_input_length function to increase the buffer size for a sequence length >=3352:

from auto_gptq import exllama_set_max_input_length

model = exllama_set_max_input_length(model, max_input_length=3352)

System Info

quay.io/jupyter/docker-stacks-foundation:python-3.11
Intel 4th gen
RTX 2080 Ti

Jul 23 '24 07:07 kuronekosaiko

text-generation-webui text-generation-webui copied to clipboard

RuntimeError, context length buffer size with Transformers loader's exllama backend require extending.

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard