Error during response generation on RTX 5090
Describe the bug
I'm getting the following error trying to use Oobabooga on a 5090 card. All libraries have been manually updated as needed around pytorch 2.7 on CUDA 12.8. The UI loads, I can ask a question, and I start to get a snippet of a response but after just a few words I crash in the console with the following and the answer stops.
AI:
11:37:59-204916 INFO WARPERS= ['MinPLogitsWarper']
Traceback (most recent call last): File "C:\AI-Content\text-generation-webui\text-generation-webui\modules\text_generation.py", line 445, in generate_reply_HF new_content = get_reply_from_output_ids(output, state, starting_from=starting_from) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI-Content\text-generation-webui\text-generation-webui\modules\text_generation.py", line 266, in get_reply_from_output_ids reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI-Content\text-generation-webui\text-generation-webui\modules\text_generation.py", line 176, in decode return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI-Content\text-generation-webui\text-generation-webui\installer_files\env\Lib\site-packages\transformers\tokenization_utils_base.py", line 3860, in decode return self._decode( ^^^^^^^^^^^^^ File "C:\AI-Content\text-generation-webui\text-generation-webui\installer_files\env\Lib\site-packages\transformers\tokenization_utils_fast.py", line 668, in _decode text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OverflowError: can't convert negative int to unsigned Output generated in 0.29 seconds (10.40 tokens/s, 3 tokens, context 479, seed 68386361)
Is there an existing issue for this?
- [x] I have searched the existing issues
Reproduction
Load UI, ask a question.
Screenshot
No response
Logs
The full error is in the original message.
System Info
RTX 5090 on Windows 11
Some additional information - After testing a bit further, this error only seems to be happening when using GPTQ format models (my preference due to speed). When I loaded the same model in GGUF format, everything does actually work.
Did a little more digging - GPTQ format works when loading with ExLlamav2. The above error is happening with ExLlamav2_HF, which used to work before the 5090 upgrade and associated pytorch 2.7 and CUDA 12.8 fixes to get everything in general working again.
Can you share what changes you made to get it working on your 5090?