text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Triton GPTQ on AMD/rocm generation very slow

Open gnawzie opened this issue 1 year ago • 5 comments

Describe the bug

I have been trying to get the triton GPTQ fork working on my AMD 6800xt, recently I did get it working using the --no-quant_attn --no-fused_mlp --no-warmup_autotune but the inference is extremely slow, slower than cpu. ryzentop says the gpu is being used, however it's extremely slow. I'm not sure whether this is a triton problem, a GPTQ-for-Llama problem, or a misconfiguration of the textgen.

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

python server.py --no-quant_attn --no-warmup_autotune --no-fused_mlp

Screenshot

No response

Logs

No errors unless I remove the triton feature disable options

System Info

Archlinux/EndeavorOS, AMD 6800xt 16gb

gnawzie avatar Apr 18 '23 09:04 gnawzie

On nvidia I am still AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'

I assume it is using unaccelerated matmul on your "unsupported" card too.

Ph0rk0z avatar Apr 18 '23 12:04 Ph0rk0z

Currently AMD GPUs are unsupported but I'm sure webui would love to add support. The main problem is that it seems like none of the maintainers of webui have an AMD GPU to test with. If you'd like to test for us and could share your installation instructions, maybe we could get everything working.

xNul avatar Apr 18 '23 13:04 xNul

cant wait until I can use this on my amd card ill be ready when you end up supporting it!

D-a-r-n-o-l-d avatar Apr 18 '23 20:04 D-a-r-n-o-l-d

@D-a-r-n-o-l-d in context, it's possible to run with gpu, but the performance is not on par, and no one has a card, follow the guide

BarfingLemurs avatar Apr 18 '23 23:04 BarfingLemurs

I think maybe AMD support isn't the responsibility of the oobabooga webui since it relies on Triton/GPTQ which are responsible for AMD performance, maybe I'll open issues with them instead if I can figure out what parts are causing it to perform poorly. The CUDA GPTQ-for-LLAMA which was forked for rocm works excellent and very fast, just a shame that AMD is always behind for machine learning applications.

gnawzie avatar Apr 19 '23 22:04 gnawzie

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar May 19 '23 23:05 github-actions[bot]