text-generation-webui
text-generation-webui copied to clipboard
RuntimeError, context length buffer size with Transformers loader's exllama backend require extending.
Describe the bug
Can't generate more than 3352 tokens with Transformers loader's exllama backend.
When try, a RuntimeError
is raised.
According to the error log, this is fixable by calling exllama_set_max_input_length(model, max_input_length=new_input_length)
, which increase the temp_state buffer size.
This seems only affect gptq models quant with act-order set to true.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
1. Load `Qwen2-7B-GPTQ` with Transformers loader without setting `disable_exllama` and `disable_exllama2`.
2a. Input a longer than 3352 tokens prompt.
2b. Set `max_new_tokens` to be longer than 3352.
3. Hit generate.
4. Check terminal.
Note: Haven't check models other than Qwen2-7B-GPTQ
.
Screenshot
No response
Logs
Traceback (most recent call last):
File "/opt/tgwui/modules/text_generation.py", line 378, in generate_reply_HF
output = shared.model.generate(**generate_params)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 1914, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2651, in _sample
outputs = self(
^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1221, in forward
outputs = self.model(
^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1023, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 777, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 186, in forward
return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 185, in forward
out = ext_q4_matmul(x, self.q4, self.width)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tgwui/installer_files/env/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 42, in ext_q4_matmul
q4_matmul(x, q4, output)
RuntimeError: The temp_state buffer is too small in the exllama backend for GPTQ with act-order. Please call the exllama_set_max_input_length function to increase the buffer size for a sequence length >=3352:
from auto_gptq import exllama_set_max_input_length
model = exllama_set_max_input_length(model, max_input_length=3352)
System Info
quay.io/jupyter/docker-stacks-foundation:python-3.11
Intel 4th gen
RTX 2080 Ti