text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

RuntimeError, context length buffer size with Transformers loader's exllama backend require extending.

Open kuronekosaiko opened this issue 7 months ago • 0 comments

Describe the bug

Can't generate more than 3352 tokens with Transformers loader's exllama backend.

When try, a RuntimeError is raised.

According to the error log, this is fixable by calling exllama_set_max_input_length(model, max_input_length=new_input_length), which increase the temp_state buffer size.

This seems only affect gptq models quant with act-order set to true.

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

1. Load `Qwen2-7B-GPTQ` with Transformers loader without setting `disable_exllama` and `disable_exllama2`.
2a. Input a longer than 3352 tokens prompt.
2b. Set `max_new_tokens` to be longer than 3352.
3. Hit generate.
4. Check terminal.

Note: Haven't check models other than Qwen2-7B-GPTQ.

Screenshot

No response

Logs

Traceback (most recent call last):

  File "/opt/tgwui/modules/text_generation.py", line 378, in generate_reply_HF

    output = shared.model.generate(**generate_params)[0]

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 1914, in generate

    result = self._sample(

             ^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2651, in _sample

    outputs = self(

              ^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1221, in forward

    outputs = self.model(

              ^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1023, in forward

    layer_outputs = decoder_layer(

                    ^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 777, in forward

    hidden_states = self.mlp(hidden_states)

                    ^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 186, in forward

    return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl

    return self._call_impl(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl

    return forward_call(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 185, in forward

    out = ext_q4_matmul(x, self.q4, self.width)

          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 42, in ext_q4_matmul

    q4_matmul(x, q4, output)

RuntimeError: The temp_state buffer is too small in the exllama backend for GPTQ with act-order. Please call the exllama_set_max_input_length function to increase the buffer size for a sequence length >=3352:

from auto_gptq import exllama_set_max_input_length

model = exllama_set_max_input_length(model, max_input_length=3352)

System Info

quay.io/jupyter/docker-stacks-foundation:python-3.11
Intel 4th gen
RTX 2080 Ti

kuronekosaiko avatar Jul 23 '24 07:07 kuronekosaiko