GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Wondering whether some of the triton or cuda kernel also speedup fp16 or not?
I am not familiar with triton or cuda. But it feels like some code(fused_attm) can also be used in fp16 to gain inference speedup compared with huggingface?