turboderp comments

Results 180 comments of


                                            turboderp

Exllama tutorials?

You need to define how weights are to be split across the GPUs. There's a bit of trial and error in that, currently, since you're only supplying the maximum allocation...

Exllama tutorials?

Where is this? Invite me! Edit: Never mind I found it.

Modify generator.py > generate_simple to accept encode_special_characters?

Well, there's a whole discussion about the "correct" way to tokenize inputs to language models. According to the SP authors, it's not correct to pass control symbols to the encoder,...

Add truncation warning

I don't think this is the right approach. Truncation is now a regular part of how the context window is adjusted, so it would spam hundreds of lines in the...

Question about storing models in Container

I'm sorry I really don't know anything about docker. @nopperl did the Docker stuff, maybe they can help?

Support for NF4?

As far as I can tell it's very hard to use efficiently in CUDA, since you need to run every quantized element through a lookup table, or as they've done...

Dynamic NTK RoPe scaling support

Yes, this is planned. Stay tuned.

Latency grows substantially as batch size increases, even with small batch sizes

The kernels are very specifically optimized for matrix-vector operations (batch size = 1). It also does well on matrix-matrix by reconstructing full-precision matrices on the fly and relying on cuBLAS....

Question about example_flask.py

There's no support for concurrency, no. You'd need a separate instance for each thread, with its own generator and cache, and some mechanism for sensibly splitting the work between threads,...

Add support to vLLM inference engine - to possibly gain x10 speedup in inference

I'm not sure how they arrive at those results. Plain HF Transformers can be mighty slow, but you have to really try to make it *that* slow, I feel. As...