turboderp

Results 180 comments of turboderp

I've had a quick look at SqueezeLLM. From what I can tell it's another quantization scheme that makes big promises but isn't even fully published yet. There's just example code...

You can process in batches if you have enough VRAM to allocate the cache with a larger batch size. There's an example of how this works in `example_batch.py`, using `generate_simple`....

Well, the implementation isn't threadsafe, and you wouldn't want two threads both trying to put a 100% load on the GPU anyway. Batching is great, though, because generating two replies...

Batch processing is always going to be way faster and use less VRAM than running multiple instances of the model, or running the same model on multiple sequences in a...

>2 * 0.31 prompts/second = 0.62 prompts/second (two instances running at same time) But how would you get double the speed when running two instances at the same time? With...

If you try to run two at the same time they will compete for CUDA cores and memory bandwidth. Best case scenario, they'll both be running at half speed, but...

Well `env | grep CUDA` should tell you if the environment variable is set. If not, `export CUDA_HOME=`. As for the differences between ExLlama and GPTQ-for-LLaMa, they are numerous. ExLlama...

I'd like to see some results from finetuning before I go and add even more config options. If I built out ExLlama every time someone had an interesting idea on...

I haven't tested a 3b model, or anything OpenLlama for that matter. Would you mind sharing the quantized model on HF? I can give it a test and see what's...