text-generation-webui
text-generation-webui copied to clipboard
Triton GPTQ on AMD/rocm generation very slow
Describe the bug
I have been trying to get the triton GPTQ fork working on my AMD 6800xt, recently I did get it working using the --no-quant_attn --no-fused_mlp --no-warmup_autotune but the inference is extremely slow, slower than cpu. ryzentop says the gpu is being used, however it's extremely slow. I'm not sure whether this is a triton problem, a GPTQ-for-Llama problem, or a misconfiguration of the textgen.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
python server.py --no-quant_attn --no-warmup_autotune --no-fused_mlp
Screenshot
No response
Logs
No errors unless I remove the triton feature disable options
System Info
Archlinux/EndeavorOS, AMD 6800xt 16gb
On nvidia I am still AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'
I assume it is using unaccelerated matmul on your "unsupported" card too.
Currently AMD GPUs are unsupported but I'm sure webui would love to add support. The main problem is that it seems like none of the maintainers of webui have an AMD GPU to test with. If you'd like to test for us and could share your installation instructions, maybe we could get everything working.
cant wait until I can use this on my amd card ill be ready when you end up supporting it!
@D-a-r-n-o-l-d in context, it's possible to run with gpu, but the performance is not on par, and no one has a card, follow the guide
I think maybe AMD support isn't the responsibility of the oobabooga webui since it relies on Triton/GPTQ which are responsible for AMD performance, maybe I'll open issues with them instead if I can figure out what parts are causing it to perform poorly. The CUDA GPTQ-for-LLAMA which was forked for rocm works excellent and very fast, just a shame that AMD is always behind for machine learning applications.
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.