Nicolas Patry comments

Results 978 comments of


                                            Nicolas Patry

Inference support for GPTQ (llama + falcon tested) + Quantization script

@0x1997 But the error happens when it looks for `g_idx`. This has nothing to do with group_size=-1, has it ? (It's always defined no ?). IIUC, `groupsize=-1` simply means full...

Inference support for GPTQ (llama + falcon tested) + Quantization script

> @jgcb00 yes, switching it to "auto" will fix it `auto` doesn't fix anything, since it will cram your GPU 0, and then there's not enough room to create the...

Inference support for GPTQ (llama + falcon tested) + Quantization script

No. Adding flags everywhere is not good. If you know what you're doing you can edit the code yourself. For the vast majority, we need to figure out the sanest...

Request a new api endpoint to check and retrieve token length for given text/prompt

If you're interested, you can compile the tokenizer down to WASM which would make it usable on the web. (It's unstable because regexp engine has to be different, this shouldn't...

How to solve "Model is overloaded" when sending 500 requests?

Hi @jshin49 , This is working as intended, this mechanism is called preventing backpressure. Basically when your server is getting saturated, you *want* to refuse new requests, otherwise you will...

How to solve "Model is overloaded" when sending 500 requests?

> If i'am using 8 A100 GPUs, can i have a bigger --max-concurrent-requests than if i'am using only 2 ? Yes, if you shard across the 8 GPUS, `--num-shard 8`....

How to solve "Model is overloaded" when sending 500 requests?

This should help you get started: https://www.youtube.com/watch?v=jlMAX2Oaht0

Support for 4bit quantization

We're implementing GPTQ https://github.com/huggingface/text-generation-inference/pull/438 which to the best of my knowledge has better latency than bitsandbytes.

Support for 4bit quantization

Nope GPTQ requires data, but the final latency is the key thing we're after. And `bitsandbytes` 8bit is really slow, not sure about the 4bit, but I'd imagine it's the...

Support for 4bit quantization

We'll definitely bench it. We did add PagedAttention because it did provide a lot of benefit.