Albert Tseng

Results 15 comments of Albert Tseng

We are still working on integration, albeit very slowly.

Cool, good to hear that our fine-tuning works for AQLM too. I also observed that the e2e fine-tuning can do most of what the blockwise fine-tuning does, which is good...

We have a better method coming out soon so quip# development has been superceded. We may eventually get around to hf support but without working cuda graphs during general its...

Hi Marc, Cuda graphs are essential for fast inference since they mask out much of the kernel launch overheads. Many quantization algorithms like QuIP# use multiple kernels during inference and...

Is there a list of such models and a guide on how to use cuda graphs with transformers? I just tried torch.compile(model.generate, mode=’reduce-overhead’) on transformers 4.42.3 with Llama 2 7B...