Let's start discussing implementation.

Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
Make sure GPTQ works for multiple models (priority to Falcon).

Currently it means that every place we use get_{tensor|sharded} to check for quantization.

My idea is to reintegrate as much as possible into utils/layer.py by expanding load_multi to be a bit more generic. This might require some thinking, but ultimately the qweight,qzeros,scales,g_idx should be in a single place, and independant of bias presence.

What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Jun 09 '23 15:06 Narsil

Hi @Narsil amazing work!

I tried your branch with https://huggingface.co/huggingface/falcon-40b-gptq today and was able to get 44ms per token on 4 x A10G, as opposed to 180ms per token I previously got with falcon-40b-instruct + bitsandbytes

Wondering when this PR will be merged and whether you will be uploading a falcon-40b-instruct-gptq as well? I think many including myself don't have access to a GPU with enough vram to quantize a 40b model

Jun 15 '23 06:06 jiyuanq

Wondering when this PR will be merged and whether you will be uploading a falcon-40b-instruct-gptq as well? I think many including myself don't have access to a GPU with enough vram to quantize a 40b model

It should technically fit even if you have a very simple GPU, as long a you can load a single layer on the GPU. I didn't go too extreme, but we're leveraging accelerate there, which should offload to CPU RAM and disk when needed, and the quantization is doing it layer per layer (You still need to fit the model in CPU RAM though + the quantized weights in order to be able to save them).

Jun 15 '23 08:06 Narsil

Is the act-order option supported?

Jun 15 '23 09:06 0x1997

Is the act-order option supported?

Is exists in code as everything is simply pulled, but not exposed yet. Any good info of what act-order does, and implications ? (If there's now a good default we'll just use that).

For now it's False by default like in https://github.com/qwopqwop200/GPTQ-for-LLaMa

Jun 15 '23 09:06 Narsil

@Narsil It's slower but has better accuracy. https://github.com/lm-sys/FastChat/blob/main/docs/gptq.md

Jun 15 '23 09:06 0x1997

@Narsil For higher speed up of LLaMA models, you can checkout the https://github.com/turboderp/exllama project. I tested it with two 13B models, both quantized with group size 128 and activation order enabled, and got 2.6x speed up over running the 16bit version with text-generation-inference.

Jun 15 '23 09:06 0x1997

@Narsil For higher speed up of LLaMA models, you can checkout the https://github.com/turboderp/exllama project. I tested it with two 13B models, both quantized with group size 128 and activation order enabled, and got 2.6x speed up over running the 16bit version with text-generation-inference.

Looks pretty sweet indeed. Will have to dig a bit deeper to see how much we can reuse (everything seems super llama focused for now)

Jun 15 '23 14:06 Narsil

Wondering when this PR will be merged and whether you will be uploading a falcon-40b-instruct-gptq as well? I think many including myself don't have access to a GPU with enough vram to quantize a 40b model

It should technically fit even if you have a very simple GPU, as long a you can load a single layer on the GPU. I didn't go too extreme, but we're leveraging accelerate there, which should offload to CPU RAM and disk when needed, and the quantization is doing it layer per layer (You still need to fit the model in CPU RAM though + the quantized weights in order to be able to save them).

Do you have any recommendation for running the quantizer on GPUs with smaller vrams? I managed to load the 40b model with offloading, but it looks like the working memory required to run fasterquant is larger than what I could fit in my GPU 0. I even tried reducing nsamples to 64 but still got cuda OOM

Jun 16 '23 06:06 jiyuanq

Try using options here:: https://huggingface.co/docs/accelerate/usage_guides/big_modeling

Notably device_map = infer_auto_device_map(my_model, max_memory={0: "10GiB", 1: "10GiB", "cpu": "30GiB"}) seems like a good option to reserve enough memory on GPU0 (you could say 0 on GPU0 since we're manually sending there).

Actually making that the default might be better than balanced_low_0

Jun 16 '23 12:06 Narsil

Try using options here:: https://huggingface.co/docs/accelerate/usage_guides/big_modeling

Notably device_map = infer_auto_device_map(my_model, max_memory={0: "10GiB", 1: "10GiB", "cpu": "30GiB"}) seems like a good option to reserve enough memory on GPU0 (you could say 0 on GPU0 since we're manually sending there).

Actually making that the default might be better than balanced_low_0

Thanks, I actually tried that, with GPU 0’s max memory set to 2GB but still no luck. Perhaps the memory required to run quantization is more than 22GB. Another thing I noticed is that the line

layer = layers[i].to(dev)

can lead to error “ NotImplementedError: Cannot copy out of meta tensor; no data”, likely because some layers are offloaded?

Jun 16 '23 12:06 jiyuanq

That means the layers wasn't loaded at all, probably disk offloaded. I'm not familiar enough with accelerate internals, but there must be some way to fetch the information of where the weights live, so that you could fill them before sending on the GPU.

Jun 16 '23 12:06 Narsil

That means the layers wasn't loaded at all, probably disk offloaded. I'm not familiar enough with accelerate internals, but there must be some way to fetch the information of where the weights live, so that you could fill them before sending on the GPU.

I see. Thanks! Let me dig a bit more to figure that out

Jun 16 '23 12:06 jiyuanq

Trying out the branch locally and running into this when trying to query the endpoint:

...
File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/quant_linear.py", line 244, in forward
    output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/quant_linear.py", line 216, in matmul248
    matmul_248_kernel[grid](
  File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/custom_autotune.py", line 110, in run
    timings = {
  File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
    config: self._bench(*args, config=config, **kwargs)
  File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/custom_autotune.py", line 90, in _bench
    return triton.testing.do_bench(
  File "/home/user/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/triton/testing.py", line 144, in do_bench
    torch.cuda.synchronize()
  File "/home/user/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/torch/cuda/__init__.py", line 688, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

 rank=0
2023-06-19T17:52:01.967782Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CUDA error: an illegal memory access was encountered

Any idea? Trying it on a falcon-7b model I quantized using the script in this PR.

Running it with: CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id local_model_path --port 8080 --quantize gptq --sharded false

Jun 19 '23 18:06 psinger

I got the following error when loading a model quantized with auto-gptq

{"timestamp":"2023-06-20T02:07:57.613035Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve\n server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve\n asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))\n\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner\n model = get_model(model_id, revision, sharded, quantize, trust_remote_code)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 178, in get_model\n return FlashLlama(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 64, in init\n model = FlashLlamaForCausalLM(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 417, in init\n self.model = FlashLlamaModel(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 316, in init\n [\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 317, in \n FlashLlamaLayer(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 248, in init\n self.self_attn = FlashLlamaAttention(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 112, in init\n self.query_key_value = TensorParallelColumnLinear.load_multi(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 226, in load_multi\n weight = weights.get_multi_weights_col(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 95, in get_multi_weights_col\n w = [self.get_tensor(f"{p}.g_idx") for p in prefixes]\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 95, in \n w = [self.get_tensor(f"{p}.g_idx") for p in prefixes]\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 47, in get_tensor\n filename = self.get_filename(tensor_name)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 34, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\n\nRuntimeError: weight model.layers.0.self_attn.q_proj.g_idx does not exist\n\n"},"target":"text_generation_launcher"} {"timestamp":"2023-06-20T02:07:57.613081Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}

Is it possible to support these models without requantization?

Jun 20 '23 02:06 0x1997

@psinger

Illegal access seems like a triton bug. Which GPU are you using ? I'm guessing if it's old, triton might be creating invalid kernels

@0x1997 Absence of g_idx means it's lacking the weight. I have no idea how they called it. Please note that we also modified our source https://github.com/qwopqwop200/GPTQ-for-LLaMa to include bits and groupsize directly in the safetensors file so that we don't need to ask the user to know in advance to specify at load time. I don't think either method does it that way.

Jun 20 '23 07:06 Narsil

@Narsil I am using A100, so not old.

I tried running it via local install and conda environment and via docker image to make sure it is not an environment issue, but same issue happens in both.

Jun 20 '23 07:06 psinger

@Narsil The model was quantized with groupsize=-1. gptq_bits and gptq_groupsize are easy to patch into the model files, but I don't know if gptq_groupsize=[-1] will be handled correctly. Another difference is weights like model.layers.0.mlp.down_proj.bias are missing when quantized with the script in this PR but present when quantized by auto-gptq.

Jun 20 '23 08:06 0x1997

@0x1997 But the error happens when it looks for g_idx. This has nothing to do with group_size=-1, has it ? (It's always defined no ?). IIUC, groupsize=-1 simply means full row. It should be relatively easy to add support back in, but I'm not sure this is the issue at hand.

@psinger cuda version ? Do you mind creating an issue with all your setup + stacktrace. It's not really possible to help without all the information (and being able to reproduce)

Jun 20 '23 08:06 Narsil

@Narsil If I am building the docker image, my local cuda shouldn't matter, or will it still have an impact?

Here is approximately what I did:

install text-generation-server commands
text-generation-server quantize tiiuae/falcon-7b falcon-7b-gptq --trust-remote-code

docker build .

docker run --gpus device=1 --shm-size 1g -p 8080:80 -v /data:/data  docker_image_id --model-id /data/
falcon-7b-gptq --quantize gptq

Jun 20 '23 08:06 psinger

@Narsil

unfortunately I also cant get it to work on a fresh H100 system due to some mismatch

I documented all the steps, please let me know if I am missing something

git clone https://github.com/huggingface/text-generation-inference.git
git checkout support_gptq

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.9 
conda activate text-generation-inference

sudo apt-get install cargo
sudo apt-get install libssl-dev gcc -y

BUILD_EXTENSIONS=True make install

# ouch this takes some time...
cd server && install-flash-attention

source ~/.bashrc
conda activate text-generation-inference

# somehow those were missing
pip install texttable
pip install einops
pip install datasets

conda install cudatoolkit

text-generation-server quantize tiiuae/falcon-7b ../data/falcon-7b-gptq --trust-remote-code

text-generation-launcher --model-id /home/user/models/falcon-7b-gptq/ --quantize gptq --trust-remote-code --port 8080

Error in this case is different to before:

python: /project/lib/Dialect/TritonGPU/Transforms/Combine.cpp:870: int {anonymous}::{anonymous}::computeCapabilityToMMAVersion(int): Assertion false && "computeCapability > 90 not supported" failed.

Seems to be some cuda / package mismatch

Jun 20 '23 12:06 psinger

Hi, I wanted to try your implementation but when I try to convert the model, i run into an issue :

text-generation-server quantize  tiiuae/falcon-40b-instruct falcon-40b-instruct-gptq --trust-remote-code

error :

  File "/root/text-generation-inference/tgi_env/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/root/text-generation-inference/server/text_generation_server/cli.py", line 174, in quantize
    quantize(

  File "/root/text-generation-inference/server/text_generation_server/utils/gptq/quantize.py", line 782, in quantize
    model = AutoModelForCausalLM.from_pretrained(

  File "/root/text-generation-inference/tgi_env/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
    return model_class.from_pretrained(

  File "/root/text-generation-inference/tgi_env/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2801, in from_pretrained
    max_memory = get_balanced_memory(

  File "/root/text-generation-inference/tgi_env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 490, in get_balanced_memory
    per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)

ZeroDivisionError: integer division or modulo by zero

I think that it is due to the fact that I have only one GPU, I will try with the change of @psinger

Jun 20 '23 13:06 jgcb00

@jgcb00 yes, switching it to "auto" will fix it

please let me know if you manage to start the endpoint with that model afterwards

Jun 20 '23 13:06 psinger

So it doesn't work after fixing it Now I have :

Token indices sequence length is longer than the specified maximum sequence length for this model (2782307 > 2048). Running this sequence through the model will result in indexing errors
Starting ...
Traceback (most recent call last):

  File "/root/text-generation-inference/tgi_env/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/root/text-generation-inference/server/text_generation_server/cli.py", line 174, in quantize
    quantize(

  File "/root/text-generation-inference/server/text_generation_server/utils/gptq/quantize.py", line 800, in quantize
    quantizers = sequential(

  File "/root/text-generation-inference/tgi_env/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)

  File "/root/text-generation-inference/server/text_generation_server/utils/gptq/quantize.py", line 619, in sequential
    inps = torch.zeros(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 79.15 GiB total capacity; 76.93 GiB already allocated; 1.26 GiB free; 76.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

But that's weird, I need to fully load the model on my gpu to be able to quantize ? or is it why @Narsil you used balanced_low_0 ?

Jun 20 '23 13:06 jgcb00

This looks like A100? I managed to quantize a 7b model on a single GPU there, but 40b will be tough. That's why you probably need multiple GPUs wih sharding then.

Jun 20 '23 13:06 psinger

But I don't get it, we cannot convert the model but then it will easily run on one signle A100 ? I will try with the device cpu see if I can convert the model with that !

Jun 20 '23 13:06 jgcb00

You need to fully load the model in float16 to quantize it. After quantization you will need way less memory. Quantizing it on CPU will be too slow I expect.

Jun 20 '23 13:06 psinger

Yes but at least it's running, 3/60 layers converted also we cannot specify which precision ? it's int4 by default ? can it be int8 ?

Jun 20 '23 13:06 jgcb00

@jgcb00 yes, switching it to "auto" will fix it

auto doesn't fix anything, since it will cram your GPU 0, and then there's not enough room to create the quantizations.

We need at least a single layer to load on GPU for everything to work correctly. balanced_low_mem_0 means we're keeping GPU as free as possible so we can do everything we need there (and we do need the room).

The error you're seeing is imho an accelerate bug (at least the error message should be descriptive), even if there might be better balance. When using accelerate we want to load what we can, offload the rest to CPU and whatever is left to DISK. I don't want to add support for handling DISK atm since it requires more accelerate internal (and this PR is already huge, we need to fix later, not here, this is not what this PR is about).

We could bypasse accelerate, but that requires even more logic, which is out of scope of this PR imo.

Jun 20 '23 13:06 Narsil

auto definitely fixes if it fits on a single GPU - balanced_low_0 assumes there are multiple GPUs, that it breaks on a single GPU machine makes sense to me. It will not offload to CPU afaik.

But I agree that auto is also not the best default, as larger models will require offloading. That's why I suggested this could be a parameter the user can set.

Jun 20 '23 13:06 psinger

No. Adding flags everywhere is not good. If you know what you're doing you can edit the code yourself. For the vast majority, we need to figure out the sanest defaults. In this case neither auto nor balanced_low_0 (actually maybe balanced_low_mem_0 if accelerate things it's ok to fix.

Jun 20 '23 14:06 Narsil

text-generation-inference
text-generation-inference copied to clipboard

Inference support for GPTQ (llama + falcon tested) + Quantization script

What does this PR do?

Before submitting

Who can review?

text-generation-inference text-generation-inference copied to clipboard

Inference support for GPTQ (llama + falcon tested) + Quantization script

What does this PR do?

Before submitting

Who can review?

text-generation-inference
text-generation-inference copied to clipboard