text-generation-inference Unable to load GPTQ weights

System Info

Hi, I'm using the latest version of text-generation-inference (image sha-ae466a8) on Runpod => docker. When I try to load a GPTQ file from local disk with QUANTIZE = gptq, I get the following trace:

2023-06-28T07:58:54.412515423-04:00 {"timestamp":"2023-06-28T11:58:54.412338Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 760, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 67, in serve\n server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 155, in serve\n asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 124, in serve_inner\n model = get_model(model_id, revision, sharded, quantize, trust_remote_code)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 185, in get_model\n return FlashLlama(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py\", line 61, in __init__\n weights = Weights(filenames, device, dtype, process_group=self.process_group)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 11, in __init__\n with safe_open(filename, framework=\"pytorch\") as f:\nFileNotFoundError: No such file or directory: \"/workspace/models/TheBloke_airoboros-33B-gpt4-1-4-GPTQ/airoboros-33b-gpt4-1-4-GPTQ-4bit-1g-act-order.safetensors\"\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}

No such file or directory: "/workspace/models/TheBloke_airoboros-33B-gpt4-1-4-GPTQ/airoboros-33b-gpt4-1-4-GPTQ-4bit-1g-act-order.safetensors

I can confirm that the file is in fact present at that location.

Loading non-gptq safetensor files (without QUANTIZE gptq) works fine, so it seems specific to the GPTQ implementation.

GPTQ files are from this repository: https://huggingface.co/TheBloke/airoboros-33B-gpt4-1.4-GPTQ . Note: I've tried a few other GPTQ files and they also don't load. Note that the file name won't exactly match the one on HuggingFace because it originally had more full stops in the file name which I thought were causing the issue. However, as you can see I tried renaming the file to simplify it and I'm still getting the error.

I'm running this on an A100 80GB, attached to network storage (where the gptq files are located). I've tried it with multiple instance types, and I don't think it's a Runpod issue.

Here's a screenshot of the settings I'm running with:

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Create a new Runpod Pod using the latest text-generation-inference docker image: ghcr.io/huggingface/text-generation-inference:latest
Configure the pod to load a GPTQ repository
When it tries to load, you'll see that it fails and goes into an infinite retry loop, shutting down the shard and restarting repeatedly.

Expected behavior

GPTQ file should load and server should start.

Jun 28 '23 12:06 ssmi153

First of all this file might fail to load regardless because this repo pushes gptq_bits and gptq_groupsize into the file itself to be able to know what kind of quantization took place. Not sure theBloke ones have that.

And I highly doubt the error "FileNotFound" to be wrong. It must be the correct error and the file is not found. If the file is there, maybe the docker is not executing in the same environment ? Maybe there's a tiny typo.

Jun 28 '23 12:06 Narsil

To clarify, I'm just giving it the repo folder name, and it's determining the file name airoboros-33b-gpt4-1-4-GPTQ-4bit-1g-act-order.safetensors by itself (and correctly doing so), so it clearly has access to the disk and to the file. Can you suggest some alternative GPTQ files or repos that I could test this with that work for you? Seeing if these load for me might narrow down the problem.

Jun 28 '23 12:06 ssmi153

This should work: https://huggingface.co/huggingface/falcon-40b-gptq

Jun 28 '23 14:06 Narsil

Oh dear, now I get the dreaded "You are using a model of type RefinedWeb to instantiate a model of type ." error, AND the No such file or directory error...

2023-06-28T11:03:17.918954322-04:00 {"timestamp":"2023-06-28T15:03:17.918720Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\nYou are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 67, in serve\n server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 155, in serve\n asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))\n\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 124, in serve_inner\n model = get_model(model_id, revision, sharded, quantize, trust_remote_code)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 220, in get_model\n return FlashRWSharded(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py\", line 51, in __init__\n weights = Weights(filenames, device, dtype, process_group=self.process_group)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 11, in __init__\n with safe_open(filename, framework=\"pytorch\") as f:\n\nFileNotFoundError: No such file or directory: "/workspace/models/huggingface_falcon-40b-gptq/model-00001-of-00003.safetensors"\n\n"},"target":"text_generation_launcher"}`

(Once, again loading from networked disk).

Jun 28 '23 15:06 ssmi153

Try a non network disk ? No such file or directory: means somehow your network disk said the file didn't exist...

Jun 30 '23 06:06 Narsil

A non-network disk worked, thanks. The whole thing is very strange... It definitely has access to the disk and to the model. Anyway, for the moment I can just use directly attached disks. Thanks for your help!

(My next issue now is that I can't get TGI to download from private HuggingFace repos to get the models onto those directly attached disks, but I'll open that as a separate issue.)

Jul 01 '23 16:07 ssmi153

HUGGING_FACE_HUB_TOKEN needs to be used to use a proper token.

Jul 02 '23 16:07 Narsil

Thanks @Narsil . Your comment helped me to figure out what I was doing wrong: I was trying to use HUGGINGFACE_HUB_TOKEN rather than HUGGING_FACE_HUB_TOKEN for this. Changing to HUGGING_FACE_HUB_TOKEN works. I'll also close the second issue with this outcome.

Jul 03 '23 01:07 ssmi153

I have tried loading multiple quantized models (generated using GPTQforLLama, AutoGPTQ) from different developers (including Bloke).

This is the error I received the most -

RuntimeError: weight gptq_bits does not exist

With some models, I received -

RuntimeError: weight model.layers.0.self_attn.q_proj.g_idx does not exist

@ssmi153 Have you managed to run GPTQ models?

Jul 10 '23 11:07 GemsFord

@GemsFord, TGI doesn't support any of those quantized models because they've got a custom quantization script which injects additional metadata into the model files. I've got a lot of respect for their developers, but I'm not a fan of this design choice. To quantize your own files you theoretically just execute text-generation-server quantize [source-model-id] [target-folder] from the CLI. However, I've recently tried this and found lots of issues with it: https://github.com/huggingface/text-generation-inference/issues/576 . The devs are very responsive, so hopefully they'll work out a better plan here.

Out of interest, I think this is the line in the code which causes the model loading problems: https://github.com/huggingface/text-generation-inference/blob/b4024edd4549ab647b02b8619b4072e33e64f1f9/server/text_generation_server/utils/weights.py#L123

Jul 10 '23 13:07 ssmi153

Thanks @ssmi153. Very much clear.

Jul 10 '23 16:07 GemsFord

which injects additional metadata into the model files.

Do you have a solution to detect number of bits and groupsize at inference which doesn't require users to know this information ahead of time ?

The idea was to NOT require users to guess things. (and having to pass in a bunch of flags everywhere).

Adding gptq_bits and gptq_groupsize to existing checkpoints should be rather easy.

Jul 10 '23 18:07 Narsil

We could add flags again to allow reusing those but I honestly don't like it long term. (Every user needs to remember to specify the flags, and go on the model weights name and hope the actual values are listed somewhere/discoverable)

@OlivierDehaene Wdyt ?

Jul 10 '23 18:07 Narsil

Please also allow to pass info through flags. Users like me just want to use quantized models which are already available. We can check the actual values they provide in example code.

Jul 10 '23 19:07 GemsFord

Just created a PR for it.

I don't really like maintaining weird, out-of-flow things in general (because now a lot of places might start to be careful about this and it is not entirely easy to know if you're not familiar with the codebase that this might happen, it doesn't show right now because everything just got refactored so there's only a few places where this shows up, but given all the new quantization techniques that are popping, I'm afraid it won't stay so clean for so long).

But here it's an exception, since there's really a lot of model out there which could benefit from this by becoming more easily usable (no need to modify the weights).

Note: the TGI quantize script because it aims to support every model we support out of the box. There are definitely some quircks about it atm, I'll try and fix them.

Jul 11 '23 08:07 Narsil

Fantastic, thanks @Narsil !

Jul 11 '23 09:07 ssmi153

text-generation-inference text-generation-inference copied to clipboard

Unable to load GPTQ weights

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard