turboderp
turboderp
There isn't a fix, no, because I haven't been able to reproduce the problem yet. I'm working on a thorough perplexity test to run with all the different possible code...
Kobold doesn't use ExLlama's sampling, only logits from the model. Ooba does use the native sampling, though, as well as ExLlama's tokenizer which is just a straight SentencePiece instance reading...
I did some more really heavy tuning for the 4090 and 3090, so it's not too surprising if it's less ideal for the H100. I'm in the process of adding...
Typo is fixed. Thanks. But attention probably isn't the issue anyway. I guess I'll have to add a profiling mode to time the CUDA kernel launches, since the performance profiles...
I'll have to take some time to look this over, but I'm not a fan of this bit: >delete message function now deletes not only selected message, but also everything...
There's something screwy going on if the Torch matmul is taking CPU time. It has to be a synchronization issue, otherwise I don't know what to make of that. Could...
There is something fishy going on for sure. SM utilization is usually a good thing. It's apparently doing extra work for some reason...? Higher GPU power consumption too. I'm very...
@dvoidus : Well, I've put graphs on hold for now, because it turns out there's too much overhead per graph launch for it to be beneficial until I compile basically...
Well, after I discovered inference on long sequences is 2-4x faster than I thought it was, maybe evaluating every prompt from the beginning isn't such a big deal after all....
So, I can't actually get this to produce any output? If I just run it as is, with a prompt of "Hello?" and a breakpoint in the stream() function, the...