turboderp

Results 180 comments of turboderp

There's definitely some room for improvement, but you're not going to see anything on the order of the difference in cost between the A100 and the 3090. When you're memory-bound,...

ExLlama can do batch generation, but ExLlamaHF was written by ooba as a wrapper specifically for TGW. I don't know how batches are normally handled in TGW, so I can't...

There's an example in `example_batch.py`. It just calls `generate_simple` with a list of input strings rather than a single string, and then it returns a list of outputs instead of...

Well, rotary position embeddings are supposed to be position-independent, i.e. they affect attention between positions *m* and *n* in a way that depends only on the difference, *n* - *m*....

>For my usecase, the main metric I care about is time to first token. What does that look like for 3B? Well, on the 4090 I'm getting about 16,500 tokens/second...

[Here's one](https://huggingface.co/iambestfeed/open_llama_3b_4bit_128g). It's the one the results in the readme are based on. Seems to work alright.

Sorry, I apparently missed this one. The cache is contained in the ExLlamaCache class, which is just a wrapper for two lists of preallocated tensors, one pair for each layer...

>So, there are still some half2 problem in fused MLP. Do they actually matter, though? Maybe you could benchmark it with and without `--silu_no_half2` to see if it's even worthwhile...

I know on Windows, Hardware-Accelerated GPU Scheduling can make a big difference to performance, so you might try enabling that. But even without that you should be seeing more t/s...

`--affinity` would only matter if for some reason the OS scheduler isn't doing its job properly and assigning the process to performance cores, which it should do automatically. The fact...