turboderp comments

Results 180 comments of


                                            turboderp

Speed on A100

There's definitely some room for improvement, but you're not going to see anything on the order of the difference in cost between the A100 and the 3090. When you're memory-bound,...

Is it possible to do batch generate?

ExLlama can do batch generation, but ExLlamaHF was written by ooba as a wrapper specifically for TGW. I don't know how batches are normally handled in TGW, so I can't...

Is it possible to do batch generate?

There's an example in `example_batch.py`. It just calls `generate_simple` with a list of input strings rather than a single string, and then it returns a list of outputs instead of...

Is it possible to do batch generate?

Well, rotary position embeddings are supposed to be position-independent, i.e. they affect attention between positions *m* and *n* in a way that depends only on the difference, *n* - *m*....

Speculative decoding?

>For my usecase, the main metric I care about is time to first token. What does that look like for 3B? Well, on the 4090 I'm getting about 16,500 tokens/second...

Speculative decoding?

[Here's one](https://huggingface.co/iambestfeed/open_llama_3b_4bit_128g). It's the one the results in the readme are based on. Seems to work alright.

KV caching?

Sorry, I apparently missed this one. The cache is contained in the ExLlamaCache class, which is just a wrapper for two lists of preallocated tensors, one pair for each layer...

Fix half2 with HIP

>So, there are still some half2 problem in fused MLP. Do they actually matter, though? Maybe you could benchmark it with and without `--silu_no_half2` to see if it's even worthwhile...

Slower tokens/s than expecting

I know on Windows, Hardware-Accelerated GPU Scheduling can make a big difference to performance, so you might try enabling that. But even without that you should be seeing more t/s...

Slower tokens/s than expecting

`--affinity` would only matter if for some reason the OS scheduler isn't doing its job properly and assigning the process to performance cores, which it should do automatically. The fact...