GPTQ-for-LLaMa Wondering whether some of the triton or cuda kernel also speedup fp16 or not?

Wondering whether some of the triton or cuda kernel also speedup fp16 or not?

Open drxmy opened this issue 2 years ago • 0 comments

I am not familiar with triton or cuda. But it feels like some code(fused_attm) can also be used in fp16 to gain inference speedup compared with huggingface?

May 31 '23 09:05 drxmy

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Wondering whether some of the triton or cuda kernel also speedup fp16 or not?

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard