Johannes Gäßler

Results 235 comments of Johannes Gäßler

Thank you for the high-quality post. I definitely agree that the hashing is suboptimal, my main concern for now is to get something that works at all, and to also...

Prior to reading the hashing function blog post I wrote a simple implementation that just uses bit shifts and xors but that already results in much better performance: | Model...

I think the model and prompt will be a bigger factor than the hardware as long as the hashing is fast enough. These are some numbers I get on my...

I've added a test for asserting that lookup decoding produces correct results. The sequences are the same for temperature 0 though the results are not going to be bit-for-bit identical....

I'm not sure what you mean by overload but I'm happy to test suggested alternatives.

I took over the Fibonacci hash implementation. For LLaMA 3 q4_K_M on an RTX 4090 it's maybe a ~1% end-to-end speedup. Results | Model | GPU | Static lookup cache...

I re-tested the performance on 1x RTX 4090 with CUDA graphs but against my expectations I am seeing virtually no performance difference compared to before: | Model | GPU |...

The numbers for the `server-ngram` branches on my repository are just the numbers I use internally to keep my branches apart. Just use the branch I'm using for this PR.

If you want any chance of getting this fixed, do a git bisect to identify the exact commit that caused performance regression and notify the corresponding dev.

>I tried to do "git bisect" to find root reason for it, but there're huge patches added between tag b1500 and b2581. Download the model as the original weights and...