Nicolas Patry

Results 978 comments of Nicolas Patry

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L133-L164 All the code is there indeed.

No. bitsandbytes is slow because it does more computation afaik.

EETQ is missing from the docker image, my bad on this: https://github.com/huggingface/text-generation-inference/pull/1081

The error is in protobuf version, the model you linked doesn't use a fast tokenizer (which is needed for additional checks in `text-generation-inference`) and the script fails during the conversion...

1. `pip install protobuf == 3.19` 2. Check for `tokenizer.json` in the repo, that's the file used by fast tokenizers. Usually we can create a fast from a slow, but...

Wait for this to land: https://github.com/huggingface/text-generation-inference/pull/438 so you can use a better latency kernel (GPTQ)

GPTQ are as fast as not quantized versions. I never ran bitsandbytes, so I have no clue, but iirc multiple times slower (~4x maybe ?) .

> I only need to replace --quantize "gptq" instead of --quantize "bitsandbytes". Correct? Or do I also need to replace the docker image? Well you would need the newest docker...

Hey I don't know for sure. The most obvious way would be to "write" the lora directly in your model, creating an entirely new lora free model. Not sure if/how...

We're going to do that automatically for you soon: https://github.com/huggingface/text-generation-inference/pull/762 In the meantime: https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1602174068 Closing this in favor of #482