turboderp comments

Results 180 comments of


                                            turboderp

SqueezeLLM Support?

I've had a quick look at SqueezeLLM. From what I can tell it's another quantization scheme that makes big promises but isn't even fully published yet. There's just example code...

Is there any way to support multiple parallel generation request to the same model?

You can process in batches if you have enough VRAM to allocate the cache with a larger batch size. There's an example of how this works in `example_batch.py`, using `generate_simple`....

Is there any way to support multiple parallel generation request to the same model?

Well, the implementation isn't threadsafe, and you wouldn't want two threads both trying to put a 100% load on the GPU anyway. Batching is great, though, because generating two replies...

Is there any way to support multiple parallel generation request to the same model?

Batch processing is always going to be way faster and use less VRAM than running multiple instances of the model, or running the same model on multiple sequences in a...

Is there any way to support multiple parallel generation request to the same model?

>2 * 0.31 prompts/second = 0.62 prompts/second (two instances running at same time) But how would you get double the speed when running two instances at the same time? With...

Is there any way to support multiple parallel generation request to the same model?

If you try to run two at the same time they will compete for CUDA cores and memory bandwidth. Best case scenario, they'll both be running at half speed, but...

Issue when attempting to run exllama (P40)

Is your CUDA_HOME set?

Issue when attempting to run exllama (P40)

Well `env | grep CUDA` should tell you if the environment variable is set. If not, `export CUDA_HOME=`. As for the differences between ExLlama and GPTQ-for-LLaMa, they are numerous. ExLlama...

NTK RoPE scaling.

I'd like to see some results from finetuning before I go and add even more config options. If I built out ExLlama every time someone had an interesting idea on...

Issue with open_llama_3b quantization by GPTQ-for-Llama

I haven't tested a 3b model, or anything OpenLlama for that matter. Would you mind sharing the quantized model on HF? I can give it a test and see what's...