Tim Dettmers comments

Results 106 comments of


                                            Tim Dettmers

Is it possible to enable fused op F.gemv_4bit in F.gemv_4bit backward?

This kernel is meant for 4-bit vector-matrix multiplication, which is a common use-case for token-by-token inference/generation; however, in the backward pass, a token-by-token backward is unusual. More commonly, a backward...

FLUTE Integration for Fast Inference

Sorry we really messed this up. This was so very close from being integrated into bitsandbytes and we failed on the bitsandbytes side to go the last mile. This was...

FLUTE Integration for Fast Inference

We are closing this for now but will reopen if we start working on this. Again, thank you for bringing this so far and sorry for messing this up.

Communicate blocksize constraints to kernels that take blocksize as a runtime argument

Very good catch. I think this would be a good contribution. In general, the latency overhead over dequantization operation is currently the biggest slowdown for these kernels. I think if...

Any plan to support block size 32?

Instead of replying, I quickly tried to implement it, but I failed. Despite this is might be a good starting point to implement this. You can find my changed on...

4-bit conversion for non-huggingface models

You can do this by using the 4-bit quantization functions on the weights of the model. One way to do this manually is to do pseudocode: ``` for module in...