Nicolas Patry
Nicolas Patry
@0x1997 But the error happens when it looks for `g_idx`. This has nothing to do with group_size=-1, has it ? (It's always defined no ?). IIUC, `groupsize=-1` simply means full...
> @jgcb00 yes, switching it to "auto" will fix it `auto` doesn't fix anything, since it will cram your GPU 0, and then there's not enough room to create the...
No. Adding flags everywhere is not good. If you know what you're doing you can edit the code yourself. For the vast majority, we need to figure out the sanest...
If you're interested, you can compile the tokenizer down to WASM which would make it usable on the web. (It's unstable because regexp engine has to be different, this shouldn't...
Hi @jshin49 , This is working as intended, this mechanism is called preventing backpressure. Basically when your server is getting saturated, you *want* to refuse new requests, otherwise you will...
> If i'am using 8 A100 GPUs, can i have a bigger --max-concurrent-requests than if i'am using only 2 ? Yes, if you shard across the 8 GPUS, `--num-shard 8`....
This should help you get started: https://www.youtube.com/watch?v=jlMAX2Oaht0
We're implementing GPTQ https://github.com/huggingface/text-generation-inference/pull/438 which to the best of my knowledge has better latency than bitsandbytes.
Nope GPTQ requires data, but the final latency is the key thing we're after. And `bitsandbytes` 8bit is really slow, not sure about the 4bit, but I'd imagine it's the...
We'll definitely bench it. We did add PagedAttention because it did provide a lot of benefit.