GPTQ-for-LLaMa icon indicating copy to clipboard operation
GPTQ-for-LLaMa copied to clipboard

Wondering whether some of the triton or cuda kernel also speedup fp16 or not?

Open drxmy opened this issue 2 years ago • 0 comments

I am not familiar with triton or cuda. But it feels like some code(fused_attm) can also be used in fp16 to gain inference speedup compared with huggingface?

drxmy avatar May 31 '23 09:05 drxmy