turboderp

Results 180 comments of turboderp

That's helpful. I'll look into it. Probably just yet another variant of GPTQ to consider.

I pushed an update now to deal with weights without groupsize. Seems to work here at least, also with the quantized matmul to give 33 tokens/second on my setup. So...

This issue seems to have gotten forgotten, but yeah, you should be getting better speeds than that. A lot has changed in the last three weeks, so you could try...

I'm not familiar with the format that AutoGPTQ produces LoRAs in. Whether it's supported or not depends on what the resulting tensors look like. If they're FP16 and they target...

I'm not sure if this is really an issue or not. The performance is likely down to the way the sampler is optimized for _reasonable_ values of top-p under an...

Yes. There isn't an easy fix for this except attempting to convert those LoRAs back to the regular non-fused format. I don't know if I'll have time for that.

Different implementations are going to perform differently in extreme cases. You could also turn up the temperature and magnify any differences that way. But chasing perfectly deterministic behavior with CUDA...

I've always assumed as much but just decided I'd look into it when they release a 33B model. I'm an elitist.

This is pretty much what the `example_flask.py` script does. Is that what you're after?

I actually thought about adding that. But I was torn because I also liked the example being really simple. I guess it would be quick enough to add a basic...