turboderp
turboderp
I haven't seen this at all. What model are you using? And what settings?
And just to be clear, is this in ExLlama's web UI or in Ooba?
Okay. I really have enough work cut out for me with this, but I guess I should try installing Kobold at some point to see how they're using it. I...
I'm not sure what that slider does, but if it truncates the cache that would definitely lead to degenerate output since the position embeddings for cached entries would be wrong....
Yes, using the KoboldAI samplers is the obvious choice for integrating into Kobold, so that's great. There's nothing special about the logits, after all. In fact you should just be...
Me neither. I'm still struggling to get it to load a model. :)
Well, it's up and running. I was just using a model that didn't have any `gptq_bits` key in its config and I got stuck on why it wasn't being recognized....
The fused attention step is mathematically equivalent to the regular attention, but there might be slight differences related to numerical precision. Maybe if some of the sampling methods are extremely...
I'll have to try and see if I can reproduce it. One thing that stands out is the call to `gen_prune_left()` which I haven't looked at in ages. I think...
I wrote a quick little script to try and spot any difference in the output between fused and regular attention: ``` from model import ExLlama, ExLlamaCache, ExLlamaConfig from tokenizer import...