basaran
basaran copied to clipboard
RuntimeError: mat1 and mat2 shapes cannot be multiplied
When I call multiple streaming completions at the same time I get the error below.
start listening on 127.0.0.1:8888
ERROR:waitress:Exception while serving /v1/completions
Traceback (most recent call last):
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/channel.py", line 428, in service
task.service()
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/task.py", line 168, in service
self.execute()
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/task.py", line 456, in execute
for chunk in app_iter:
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/werkzeug/wsgi.py", line 500, in __next__
return self._next()
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/werkzeug/wrappers/response.py", line 50, in _iter_encoded
for item in iterable:
File "/home/chang/AI/llm/basaran/basaran/__main__.py", line 168, in stream
for choice in stream_model(**options):
File "/home/chang/AI/llm/basaran/basaran/model.py", line 73, in __call__
for (
File "/home/chang/AI/llm/basaran/basaran/model.py", line 237, in generate
outputs = self.model(
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 662, in forward
outputs = self.gpt_neox(
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 553, in forward
outputs = layer(
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 335, in forward
mlp_output = self.mlp(self.post_attention_layernorm(hidden_states))
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 297, in forward
hidden_states = self.dense_4h_to_h(hidden_states)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 320, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 500, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 417, in forward
output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (238x13 and 29x5120)
ERROR:waitress:Exception while serving /v1/completions
We've ran into the exact same error before: https://github.com/hyperonym/basaran/issues/5. The error is caused by https://github.com/TimDettmers/bitsandbytes/issues/162 and seems fully random.
Currently the only workaround is to stop using INT8
quantization, and use half-precision instead.