text-generation-inference
text-generation-inference copied to clipboard
Inference support for GPTQ (llama + falcon tested) + Quantization script
Let's start discussing implementation.
- Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).
Currently it means that every place we use get_{tensor|sharded} to check for quantization.
My idea is to reintegrate as much as possible into utils/layer.py by expanding load_multi to be a bit more generic.
This might require some thinking, but ultimately the qweight,qzeros,scales,g_idx should be in a single place, and independant of bias presence.
What does this PR do?
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Hi @Narsil amazing work!
I tried your branch with https://huggingface.co/huggingface/falcon-40b-gptq today and was able to get 44ms per token on 4 x A10G, as opposed to 180ms per token I previously got with falcon-40b-instruct + bitsandbytes
Wondering when this PR will be merged and whether you will be uploading a falcon-40b-instruct-gptq as well? I think many including myself don't have access to a GPU with enough vram to quantize a 40b model
Wondering when this PR will be merged and whether you will be uploading a falcon-40b-instruct-gptq as well? I think many including myself don't have access to a GPU with enough vram to quantize a 40b model
It should technically fit even if you have a very simple GPU, as long a you can load a single layer on the GPU. I didn't go too extreme, but we're leveraging accelerate there, which should offload to CPU RAM and disk when needed, and the quantization is doing it layer per layer (You still need to fit the model in CPU RAM though + the quantized weights in order to be able to save them).
Is the act-order option supported?
Is the act-order option supported?
Is exists in code as everything is simply pulled, but not exposed yet. Any good info of what act-order does, and implications ? (If there's now a good default we'll just use that).
For now it's False by default like in https://github.com/qwopqwop200/GPTQ-for-LLaMa
@Narsil It's slower but has better accuracy. https://github.com/lm-sys/FastChat/blob/main/docs/gptq.md
@Narsil For higher speed up of LLaMA models, you can checkout the https://github.com/turboderp/exllama project. I tested it with two 13B models, both quantized with group size 128 and activation order enabled, and got 2.6x speed up over running the 16bit version with text-generation-inference.
@Narsil For higher speed up of LLaMA models, you can checkout the https://github.com/turboderp/exllama project. I tested it with two 13B models, both quantized with group size 128 and activation order enabled, and got 2.6x speed up over running the 16bit version with text-generation-inference.
Looks pretty sweet indeed. Will have to dig a bit deeper to see how much we can reuse (everything seems super llama focused for now)
Wondering when this PR will be merged and whether you will be uploading a falcon-40b-instruct-gptq as well? I think many including myself don't have access to a GPU with enough vram to quantize a 40b model
It should technically fit even if you have a very simple GPU, as long a you can load a single layer on the GPU. I didn't go too extreme, but we're leveraging
acceleratethere, which should offload to CPU RAM and disk when needed, and the quantization is doing it layer per layer (You still need to fit the model in CPU RAM though + the quantized weights in order to be able to save them).
Do you have any recommendation for running the quantizer on GPUs with smaller vrams? I managed to load the 40b model with offloading, but it looks like the working memory required to run fasterquant is larger than what I could fit in my GPU 0. I even tried reducing nsamples to 64 but still got cuda OOM
Try using options here:: https://huggingface.co/docs/accelerate/usage_guides/big_modeling
Notably device_map = infer_auto_device_map(my_model, max_memory={0: "10GiB", 1: "10GiB", "cpu": "30GiB"}) seems like a good option to reserve enough memory on GPU0 (you could say 0 on GPU0 since we're manually sending there).
Actually making that the default might be better than balanced_low_0
Try using options here:: https://huggingface.co/docs/accelerate/usage_guides/big_modeling
Notably
device_map = infer_auto_device_map(my_model, max_memory={0: "10GiB", 1: "10GiB", "cpu": "30GiB"})seems like a good option to reserve enough memory on GPU0 (you could say 0 on GPU0 since we're manually sending there).Actually making that the default might be better than
balanced_low_0
Thanks, I actually tried that, with GPU 0’s max memory set to 2GB but still no luck. Perhaps the memory required to run quantization is more than 22GB. Another thing I noticed is that the line
layer = layers[i].to(dev)
can lead to error “ NotImplementedError: Cannot copy out of meta tensor; no data”, likely because some layers are offloaded?
That means the layers wasn't loaded at all, probably disk offloaded. I'm not familiar enough with accelerate internals, but there must be some way to fetch the information of where the weights live, so that you could fill them before sending on the GPU.
That means the layers wasn't loaded at all, probably disk offloaded. I'm not familiar enough with accelerate internals, but there must be some way to fetch the information of where the weights live, so that you could fill them before sending on the GPU.
I see. Thanks! Let me dig a bit more to figure that out
Trying out the branch locally and running into this when trying to query the endpoint:
...
File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/quant_linear.py", line 244, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/quant_linear.py", line 216, in matmul248
matmul_248_kernel[grid](
File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/custom_autotune.py", line 110, in run
timings = {
File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
config: self._bench(*args, config=config, **kwargs)
File "/home/user/text-generation-inference/server/text_generation_server/utils/gptq/custom_autotune.py", line 90, in _bench
return triton.testing.do_bench(
File "/home/user/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/triton/testing.py", line 144, in do_bench
torch.cuda.synchronize()
File "/home/user/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/torch/cuda/__init__.py", line 688, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
rank=0
2023-06-19T17:52:01.967782Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CUDA error: an illegal memory access was encountered
Any idea? Trying it on a falcon-7b model I quantized using the script in this PR.
Running it with:
CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id local_model_path --port 8080 --quantize gptq --sharded false
I got the following error when loading a model quantized with auto-gptq
{"timestamp":"2023-06-20T02:07:57.613035Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in
\n sys.exit(app())\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve\n server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve\n asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))\n\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner\n model = get_model(model_id, revision, sharded, quantize, trust_remote_code)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 178, in get_model\n return FlashLlama(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 64, in init\n model = FlashLlamaForCausalLM(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 417, in init\n self.model = FlashLlamaModel(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 316, in init\n [\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 317, in \n FlashLlamaLayer(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 248, in init\n self.self_attn = FlashLlamaAttention(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 112, in init\n self.query_key_value = TensorParallelColumnLinear.load_multi(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 226, in load_multi\n weight = weights.get_multi_weights_col(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 95, in get_multi_weights_col\n w = [self.get_tensor(f"{p}.g_idx") for p in prefixes]\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 95, in \n w = [self.get_tensor(f"{p}.g_idx") for p in prefixes]\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 47, in get_tensor\n filename = self.get_filename(tensor_name)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 34, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\n\nRuntimeError: weight model.layers.0.self_attn.q_proj.g_idx does not exist\n\n"},"target":"text_generation_launcher"} {"timestamp":"2023-06-20T02:07:57.613081Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Is it possible to support these models without requantization?
@psinger
Illegal access seems like a triton bug. Which GPU are you using ? I'm guessing if it's old, triton might be creating invalid kernels
@0x1997 Absence of g_idx means it's lacking the weight. I have no idea how they called it.
Please note that we also modified our source https://github.com/qwopqwop200/GPTQ-for-LLaMa to include bits and groupsize directly in the safetensors file so that we don't need to ask the user to know in advance to specify at load time.
I don't think either method does it that way.
@Narsil I am using A100, so not old.
I tried running it via local install and conda environment and via docker image to make sure it is not an environment issue, but same issue happens in both.
@Narsil The model was quantized with groupsize=-1. gptq_bits and gptq_groupsize are easy to patch into the model files, but I don't know if gptq_groupsize=[-1] will be handled correctly.
Another difference is weights like model.layers.0.mlp.down_proj.bias are missing when quantized with the script in this PR but present when quantized by auto-gptq.
@0x1997 But the error happens when it looks for g_idx. This has nothing to do with group_size=-1, has it ? (It's always defined no ?). IIUC, groupsize=-1 simply means full row. It should be relatively easy to add support back in, but I'm not sure this is the issue at hand.
@psinger cuda version ? Do you mind creating an issue with all your setup + stacktrace. It's not really possible to help without all the information (and being able to reproduce)
@Narsil If I am building the docker image, my local cuda shouldn't matter, or will it still have an impact?
Here is approximately what I did:
install text-generation-server commands
text-generation-server quantize tiiuae/falcon-7b falcon-7b-gptq --trust-remote-code
docker build .
docker run --gpus device=1 --shm-size 1g -p 8080:80 -v /data:/data docker_image_id --model-id /data/
falcon-7b-gptq --quantize gptq
@Narsil
unfortunately I also cant get it to work on a fresh H100 system due to some mismatch
I documented all the steps, please let me know if I am missing something
git clone https://github.com/huggingface/text-generation-inference.git
git checkout support_gptq
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
conda create -n text-generation-inference python=3.9
conda activate text-generation-inference
sudo apt-get install cargo
sudo apt-get install libssl-dev gcc -y
BUILD_EXTENSIONS=True make install
# ouch this takes some time...
cd server && install-flash-attention
source ~/.bashrc
conda activate text-generation-inference
# somehow those were missing
pip install texttable
pip install einops
pip install datasets
conda install cudatoolkit
text-generation-server quantize tiiuae/falcon-7b ../data/falcon-7b-gptq --trust-remote-code
text-generation-launcher --model-id /home/user/models/falcon-7b-gptq/ --quantize gptq --trust-remote-code --port 8080
Error in this case is different to before:
python: /project/lib/Dialect/TritonGPU/Transforms/Combine.cpp:870: int {anonymous}::{anonymous}::computeCapabilityToMMAVersion(int): Assertion false && "computeCapability > 90 not supported" failed.
Seems to be some cuda / package mismatch
Hi, I wanted to try your implementation but when I try to convert the model, i run into an issue :
text-generation-server quantize tiiuae/falcon-40b-instruct falcon-40b-instruct-gptq --trust-remote-code
error :
File "/root/text-generation-inference/tgi_env/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/root/text-generation-inference/server/text_generation_server/cli.py", line 174, in quantize
quantize(
File "/root/text-generation-inference/server/text_generation_server/utils/gptq/quantize.py", line 782, in quantize
model = AutoModelForCausalLM.from_pretrained(
File "/root/text-generation-inference/tgi_env/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
return model_class.from_pretrained(
File "/root/text-generation-inference/tgi_env/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2801, in from_pretrained
max_memory = get_balanced_memory(
File "/root/text-generation-inference/tgi_env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 490, in get_balanced_memory
per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
ZeroDivisionError: integer division or modulo by zero
I think that it is due to the fact that I have only one GPU, I will try with the change of @psinger
@jgcb00 yes, switching it to "auto" will fix it
please let me know if you manage to start the endpoint with that model afterwards
So it doesn't work after fixing it Now I have :
Token indices sequence length is longer than the specified maximum sequence length for this model (2782307 > 2048). Running this sequence through the model will result in indexing errors
Starting ...
Traceback (most recent call last):
File "/root/text-generation-inference/tgi_env/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/root/text-generation-inference/server/text_generation_server/cli.py", line 174, in quantize
quantize(
File "/root/text-generation-inference/server/text_generation_server/utils/gptq/quantize.py", line 800, in quantize
quantizers = sequential(
File "/root/text-generation-inference/tgi_env/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/text-generation-inference/server/text_generation_server/utils/gptq/quantize.py", line 619, in sequential
inps = torch.zeros(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 79.15 GiB total capacity; 76.93 GiB already allocated; 1.26 GiB free; 76.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
But that's weird, I need to fully load the model on my gpu to be able to quantize ? or is it why @Narsil you used balanced_low_0 ?
This looks like A100? I managed to quantize a 7b model on a single GPU there, but 40b will be tough. That's why you probably need multiple GPUs wih sharding then.
But I don't get it, we cannot convert the model but then it will easily run on one signle A100 ? I will try with the device cpu see if I can convert the model with that !
You need to fully load the model in float16 to quantize it. After quantization you will need way less memory. Quantizing it on CPU will be too slow I expect.
Yes but at least it's running, 3/60 layers converted also we cannot specify which precision ? it's int4 by default ? can it be int8 ?
@jgcb00 yes, switching it to "auto" will fix it
auto doesn't fix anything, since it will cram your GPU 0, and then there's not enough room to create the quantizations.
We need at least a single layer to load on GPU for everything to work correctly. balanced_low_mem_0 means we're keeping GPU as free as possible so we can do everything we need there (and we do need the room).
The error you're seeing is imho an accelerate bug (at least the error message should be descriptive), even if there might be better balance. When using accelerate we want to load what we can, offload the rest to CPU and whatever is left to DISK.
I don't want to add support for handling DISK atm since it requires more accelerate internal (and this PR is already huge, we need to fix later, not here, this is not what this PR is about).
We could bypasse accelerate, but that requires even more logic, which is out of scope of this PR imo.
auto definitely fixes if it fits on a single GPU - balanced_low_0 assumes there are multiple GPUs, that it breaks on a single GPU machine makes sense to me. It will not offload to CPU afaik.
But I agree that auto is also not the best default, as larger models will require offloading. That's why I suggested this could be a parameter the user can set.
No. Adding flags everywhere is not good. If you know what you're doing you can edit the code yourself.
For the vast majority, we need to figure out the sanest defaults. In this case neither auto nor balanced_low_0 (actually maybe balanced_low_mem_0 if accelerate things it's ok to fix.