guidance icon indicating copy to clipboard operation
guidance copied to clipboard

[Feature Request] GPTQ support

Open the-xentropy opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. VRAM is a major limitation for running most models locally, and guidance by design requires to run models locally to get the most value out of the library. Hence, guidance is in a position to really benefit from supporting quantization approaches - GPTQ is the one in question for this FR.

Describe the solution you'd like There are numerous very interesting quantization libraries. For example:

  • GPTQ
  • GGML / ctransformers
  • (new) SPQR

We already have a pull request for GGML through llama-cpp-python, but we lack equivalent support for GPTQ.

It's actually very easy to get working locally already. For example:

!git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git --depth=1 gptq
import os.path
sys.path.append(os.path.realpath("./gptq")+"/")
import llama_inference
llama_inference.transformers = transformers # the oobabooga fork has a bug in the llama_inference.py file that references it before import. dumb but easy fix
model = llama_inference.load_quant("TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ","Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors",4,128).to("cuda:0")
tokenizer = transformers.LlamaTokenizer.from_pretrained("TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ")

Then, as usual:

import guidance
guidance.llm = guidance.llms.transformers.Vicuna(model=model, tokenizer=tokenizer, device="cuda:0")

However, the above is obviously a very hack-y proof of concept and we might benefit from a more standardized/robust means of supporting GPTQ. Due to the obvious value that being able to load larger models with less VRAM has I think there's a good case to be made for offering some helpers to load GPTQ models more conveniently.

Describe alternatives you've considered We have the option of using oobabooga's fork, or the more directly upstream repo at https://github.com/qwopqwop200/GPTQ-for-LLaMa . Personally, I would opt for oobabooga's fork because it sees less churn and experimental upgrades/changes, and is a key dependency of the extremely popular oobabooga text webui and therefore likely to focus on staying functional.

the-xentropy avatar Jun 07 '23 14:06 the-xentropy

Yes please, this would be great. This would allow loading of upto 30B models with a 24GB VRAM consumer GPU.

Also, instead of using GPTQ-for-Llama, please use AutoGPTQ ( https://github.com/PanQiWei/AutoGPTQ ) instead as it is a higher level library and is literally just 1 line to import:

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir,
    model_basename=model_basename,
    device="cuda:0",
    use_safetensors=True,
    use_triton=False,
    use_cuda_fp16=False,
)

LoopControl avatar Jun 08 '23 02:06 LoopControl

@the-xentropy AutoGPTG and GPTQ-for-LLaMa are already supported through the transformers adapter. I can't see why using a dependency is a "hack-y proof of concept". If there's something to be done is on GPTQ-for-LLaMa side, ask them to properly package it as a library.

knoopx avatar Jun 08 '23 13:06 knoopx

Hack-y as in "I dynamically altered the imports of a CLI tool to fix a bug in it and make it usable as a library" :)

They're definitely API-wise compatible if you're willing to spend time on it and hunting down weird behaviors, but whether it's convenient or a supported use-case is a different matter. Unfortunately the current state of compatibility and ease-of-use leaves a bit to be desired.

Examples:

For both AutoGPTQ and GPTQ-for-LLaMa the TheBloke/guanaco-33B-GPTQ model will only ever generate one token when used together with Guidance, and turning off acceleration, streaming, token healing, etc, seems to make no difference. Other models seem to work fine - but why? No clue.

Clearly this has to be fixed somewhere in guidance since it's only present in this library, but until there's a decision on how to support GPTQ and if that's even desirable, reports like this are really tricky to triage and (imo) make more sense to think of as edge cases or milestones toward proper GPTQ support.

When loading TheBloke/guanaco-33B-GPTQ AutoGPTQ uses 22 GB of VRAM for a 33b model in 4bit mode.

GPTQ-for-LLaMa takes much longer to load, but uses only 16GB of VRAM, and requires that you manually download the model since it doesn't have the ability to load from the huggingface hub directly.

If they're both are otherwise mostly compatible, do we want to document which GPTQ 'functionality provider' is better suited for use with Guidance? This is not really a technical decision, but since GPTQ is a critical feature for the hobbyist crowd and are extra reliant on ease-of-use features and approachable onboarding since most are not ML experts, guidance could benefit from taking a stance on it and document how to do so in order to boost popularity.

I think in either case, GPTQ support is neither at zero nor at a hundred percent, so until there's a decision on the direction to take GPTQ support in I think the most helpful thing we can do is keep documenting things that work and don't work both for each other's sake as well as potential bugfixes to track long-term.

the-xentropy avatar Jun 17 '23 19:06 the-xentropy

I'd love to contribute to adding GPTQ support, although I may need some guidance on where to get started (has someone already tried and have a WIP branch I could continue?), any known gotchas to avoid, acceptance criteria, etc.

Is there a reason @LoopControl 's suggestion wouldn't work as follows?

   from transformers import AutoTokenizer, AutoModelForCausalLM
+  from auto_gptq import AutoGPTQForCausalLM
  
   tokenizer = AutoTokenizer.from_pretrained(model_name)
-  model = AutoModelForCausalLM.from_pretrained(model_name)
+  model = AutoGPTQForCausalLM.from_quantized(
+      quantized_model_dir,
+      model_basename=model_basename,
+      device="cuda:0",
+      use_safetensors=True,
+      use_triton=False,
+      use_cuda_fp16=False,
+  )
  
  llm = guidance.llms.Transformers(model, tokenizer)

(Normally I'd test instead of ask, I'm unable to test tonight. Might be able to try later this week)

Or is the acceptance criteria actually to specifically support GPTQ-for-LLaMa?

Or is the acceptance criteria the following?

guidance.llm = guidance.llms.AutoGPTQ(
    quantized_model_dir,
    model_basename=model_basename,
    device="cuda:0",
    use_safetensors=True,
    use_triton=False,
    use_cuda_fp16=False,
)

If the latter, should I extend the Transformers class and if so I'm not sure how to support these specialized transformer classes, such as guidance.llms.transformers.LLaMA(...) without duplicating their files -- any ideas?

Disclaimer: Not very familiar with Python, Huggingface transformers, etc so apologies in advance for any silly questions.

Glavin001 avatar Jul 11 '23 05:07 Glavin001

So, it doesn't seem like much is technically required to implement this, but it is kinda hacky right now, at least when I try with a mixtral gptq.

If I just do:

models.Transformers("mixtral-path")

I get an error that bf16 isn't supported on CPU, of course. But I don't want it to ever go to my cpu.

So I can do this instead:

model = AutoModelForCausalLM.from_pretrained(MODEL_PATH,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision=REV)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, use_fast=True)
gmodel = models.transformers.LlamaChat(model=model, tokenizer=tokenizer)

That works, but that feels hacky to use models.transformers.LlamaChat, and then, how can I have control over chat templating? Or do I need to use models.transformers.Llama which I assume doesn't impose a chat template, and then I can choose my chat template? These are questions for which I'll need to dig into the guidance code, and it's a slight infelicity on this library.

freckletonj avatar Jan 05 '24 05:01 freckletonj