unsloth [Feature request] Support GPTQ quantization

So I have a GPTQ llama model I downloaded (from TheBloke), and it's already 4 bit quantized. I have to pass in False for the load_in_4bit parameter of:

model, tokenizer = FastLlamaModel.from_pretrained(

because if I don't, I get an error thrown saying:

The model is already quantized with gptq. You can't quantize it again with bitsandbytes

But, if I pass in False for load_in_4bit, this code makes bnb_config be None:

        bnb_config = None
        if load_in_4bit:
            bnb_config = BitsAndBytesConfig(
                load_in_4bit              = True,
                bnb_4bit_use_double_quant = True,
                bnb_4bit_quant_type       = "nf4",
                bnb_4bit_compute_dtype    = dtype,
            )

and that makes quantization_config be None as well:

quantization_config = bnb_config,

and that crashes here:

        if hasattr(self, "quantization_config"):
            output["quantization_config"] = (
                self.quantization_config.to_dict()

with the error message:

'NoneType' object has no attribute 'to_dict'

So I'm not sure how to LoRA train this llama model. Any thoughts?

Dec 17 '23 13:12 araleza

I tried adding:

[...] and self.quantization_config is not None:

to the end of that line there (and similar additions in two other places that came up), and it hasn't crashed, but it's now taking a very long time to load the model, so maybe it's doing some unwanted conversion?

Dec 17 '23 13:12 araleza

Yeah, it finally 'loaded' but then it said some weights of the model checkpoint were not used when initializing LlamaForCausalLM, and it listed a giant list of weights, which I'm guessing was all of them.

The the LoRA training crashed with:

Cannot copy out of meta tensor; no data!

So something definitely did not go well.

Dec 17 '23 13:12 araleza

@araleza Oh no I don't think GPTQ models are supported as of yet

Dec 17 '23 13:12 danielhanchen

Currently only QLoRA via bitsandbytes is supported, hence all the error messages. If GPTQ is a super popular request, I will add it in - the dequantization steps will just be replaced, but I will have to read up on how GPTQ does it internally.

For now, is it possible to use a non GPTQ quantized model?

Dec 17 '23 14:12 danielhanchen

For now, is it possible to use a non GPTQ quantized model?

I don't know actually... I've only done LoRA training with oobabooga's Training tab, and it can only do LoRA training with unquantized models, or GPTQ models (which you have to load with the Transformers loader). So I don't know how to load a quantized model of any format except GPTQ onto my GPU. Any tips for which format to use instead, but still have it fit on my 24GB GPU?

Dec 17 '23 14:12 araleza

@araleza Would it be possible to try load a non quantized model, then pass load_in_4bit = True via Unsloth? It should load on ur CPU / RAM then it quantizes then loads it into the GPU

Dec 17 '23 14:12 danielhanchen

I'll see for a future release if I can add GPTQ support!

Dec 17 '23 14:12 danielhanchen

I was atually just reading up upon HQQ (half quadratic quantization) https://github.com/mobiusml/hqq and maybe I'll be adding HQQ instead of GPTQ since HQQ has no need for data calibration, whilst GPTQ does.

Dec 31 '23 08:12 danielhanchen

Sounds good. I think you've got two groups of people who want to use your software:

people who have a big model and big training data, and want the fine tuning to be faster
people with 24GB cards who want to train larger models, but without quantizing them so badly that the training is meaningless.

Supporting HQQ would help the people in group 2, like me.

Dec 31 '23 10:12 araleza

@araleza Cool I'll get on with HQQ! It seems like even Mixtral can supposedly fit on a 24GB card!

But HQQ supports 8, 4, 3 and 2 bit quantization so it'll be pretty useful!

Dec 31 '23 11:12 danielhanchen

@danielhanchen happy to pitch in with quantization (or other feature requests). let me know how best to contribute!

Jan 13 '24 18:01 jeromeku

@jeromeku More than happy to collaborate! I was actually taking a look at GPTQ the other day - I guess technically Unsloth can add in GPTQ during training - we we need is to port the dequantization kernels from GPTQ to float16 / bfloat16, and if that works, then GPTQ will be supported.

For all, I'm using bitsandbytes's dequantization kernels.

Again more than happy to collaborate if you're interested!

Jan 14 '24 07:01 danielhanchen

@danielhanchen That should work -- this is what QLoRA does under the hood for non-LoRA weights right? I.e., dequantizes 'frozen' weights to f16 / bf16 in order to pass grads through non-LoRA layers.

I can take a crack at this if you're more keen working on hqq...

Jan 14 '24 08:01 jeromeku

@jeromeku I'll investigate GPTQ's dequant kernels as well! But if you're interested in adding GPTQ support - I'm more than happy for a few more OSS collaborators!

Essentially in terms of the main gist of things:

Find how GPTQ dequantizes its quantized weights to float16 / bfloat16
Extract this functionality from say Huggingface internals or some other provider like Exllama / llama.cpp etc
Replace fast_dequantize with GPTQ equivalent kernels
Fix up a few lines where Linear4bit naming conventions are seen with GPTQ equivalent conventions.
If 3 works as is, then Unsloth is now GPTQ compatible!

If you wanna take a crack at that - I'll be super grateful! In fact just step 1 or 2 is enough for a general GPTQ integration!

Jan 14 '24 14:01 danielhanchen

@danielhanchen Will work on it!

Jan 14 '24 15:01 jeromeku

@jeromeku Great! If you need any help - ask away! I guess we can use this Github issue as a central discussion area. I'll see if I have some time on GPTQ - probably next week ish - I'm trying to work on some other stuff currently.

Again thanks!

Jan 15 '24 02:01 danielhanchen

@danielhanchen

Trying to understand design decisions / coding style of the library.

What is the purpose of patching {Mistral, Llama}_fast_forward when initializing Mistral (pre_patch)? It seems you are extracting sections directly from the original HF implementations of these layers (which already support flash-attn2) and in some cases using xformers for some of the ops.

Why the use of pass after every function? This is (AFAIK) a rather unconventional python coding style?

Jan 15 '24 03:01 jeromeku

@jeromeku prepatch essentially just patches some portions of each function to call their relevant efficient implementation - ie as you mentioned some xformers some FA2.

Oh ye sorry on my coding style - I came from like like C++ / C background so I generally like all functions / if / for loops etc to be "enclosed" to make it "look" compartmentalized.

But you can have whatever coding style you like - for eg I like spaces between eqals during variance assignments, whilst general style is var=2 and not var = 2. It definitely comes from my C background!!

If you're contributing code - I don't mind on style - that's the least of worries! :)) You can use any style you desire - it just has to work :)

Jan 15 '24 04:01 danielhanchen

@danielhanchen

Any tools / tests you use to check the correctness of gradient implementations?

Jan 19 '24 18:01 jeromeku

@jeromeku Oh lol what I do is to get HF to do training, copy paste the training losses to Google Sheets, then with ur updated gradient implementation, log if the new training loss is mostly identical.

Another approach is to use torch.dist or torch.all_close on W.grad and new_W.grad to confirm the gradients. You'll have to do loss.backward(Y) for eg to get the gradients.

Jan 20 '24 04:01 danielhanchen

@danielhanchen

Ok, was wondering if there was a more efficient way to do this verification. Was trying to use torch.autograd.gradcheck but runs into issues with large inputs / outputs and mixed precision since it needs to realize the full VJP during numerical / analytical gradient calc.

I've adapted GPTQ code to re-implement fast_lora custom fwd / bwd and should have the rest done by early next week.

A minimal way to check the gradient is being calculated correctly -- akin to a unit test -- without having to do a training run would be a worthwhile effort both for existing and future implementations.

Jan 20 '24 07:01 jeromeku

@jeromeku Actually I did technically make some functions to check gradients somewhere - I manaully made some random inputs and some random outputs, then backpropagated with torch.backward(outputs), and checked every item's .grad to confirm it - I just need to find where I wrote it :))

Jan 20 '24 08:01 danielhanchen

@danielhanchen

I wrote a small test script to do gradient checking:

import torch
from datasets import load_dataset

# 4bit pre quantized models we support for 4x faster downloading!
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from torch.utils.data import DataLoader

from unsloth import FastLanguageModel

DTYPE = torch.float16


def get_model(
    model_id="unsloth/mistral-7b-bnb-4bit",
    reference=True,
    max_seq_length=2048,
    dtype=torch.float16,
    load_in_4bit=True,
    init_lora_weights=False,
    upcast=True,
):
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_id,
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

    lora_config = LoraConfig(
        r=16,
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_alpha=16,
        lora_dropout=0,
        bias="none",
        task_type="CAUSAL_LM",
        init_lora_weights=init_lora_weights,
    )

    if reference:
        model = prepare_model_for_kbit_training(
            model,
            use_gradient_checkpointing=True,
            gradient_checkpointing_kwargs={"use_reentrant": True},
        )
        model = get_peft_model(model, lora_config)
    else:
        config = lora_config.to_dict()
        del config["task_type"]
        model = FastLanguageModel.get_peft_model(
            model,
            use_gradient_checkpointing=True,
            random_state=3407,
            max_seq_length=max_seq_length,
            upcast=upcast,
            **config,
        )

    return model, tokenizer


ref_model, _ = get_model(dtype=DTYPE)
test_model, _ = get_model(dtype=DTYPE, reference=False)


def check_grad(model, dtype, seed=0, scale=1):
    wrapped_model = model.model.model
    embed_layer = wrapped_model.embed_tokens
    self_attn = wrapped_model.layers[0].self_attn
    mlp = wrapped_model.layers[0].mlp
    torch.manual_seed(seed)

    with torch.autocast(device_type="cuda", dtype=dtype):
        # embeddings = embed_layer(inputs)

        embeddings = torch.randn(
            1, 1, embed_layer.weight.shape[1], dtype=dtype, requires_grad=True
        ).cuda()
        print(f"Attention input dtype: {embeddings.dtype}")
        attn_out, *_ = self_attn(embeddings)
        print(f"Attn out dtype: {attn_out.dtype}")
        mlp_out = mlp(attn_out)

        torch.manual_seed(seed)
        fake_grad_output = scale * torch.randn(mlp_out.shape, dtype=torch.float32).to(
            mlp_out.device
        )
        mlp_out.backward(fake_grad_output)

    return mlp_out, mlp, attn_out, fake_grad_output


mlp_out_ref, mlp_ref, attn_out_ref, fake_grad_ref = check_grad(ref_model, dtype=DTYPE)
print(
    "Grad check after reference backwards:",
    test_model.model.model.layers[0].mlp.down_proj.lora_B.default.weight.grad,
)
mlp_out, mlp, attn_out, fake_grad = check_grad(test_model, dtype=DTYPE)

ref_type = torch.float32
print()
print(
    f"Checking fake grad (dY): {torch.allclose(fake_grad.to(ref_type), fake_grad_ref.to(ref_type))}"
)
# torch.max(torch.abs(fake_grad.to(ref_type) - fake_grad_ref.to(ref_type)))
# torch.allclose(mlp_out.to(ref_type), mlp_out_ref.to(ref_type))

print(f"Checking mlp grads:")
for (n1, m1), (n2, m2) in zip(mlp.named_parameters(), mlp_ref.named_parameters()):
    if "lora" in n1 and "lora" in n2:
        n1 = ".".join(n1.split(".")[:2])
        print(f"{n1}")
        print(
            f"Mean grad:\n  UNSLOTH: {m1.grad.max():.10f}\n  REF: {m2.grad.mean():.10f}\nMax abs diff: {torch.max(torch.abs(m1.grad - m2.grad)):.10f}\nMean abs diff: {torch.mean(torch.abs(m1.grad - m2.grad)):.10f}"
        )
        print()

print("Checking attn grads:")
for (n1, m1), (n2, m2) in zip(
    ref_model.model.model.layers[0].self_attn.named_parameters(),
    test_model.model.model.layers[0].self_attn.named_parameters(),
):
    if "lora" in n1 and "lora" in n2:
        # torch.allclose(m1.grad.to(dtype), m2.grad.to(dtype))
        n1 = ".".join(n1.split(".")[:2])
        print(f"{n1}")
        print(
            f"Mean grad:\n  UNSLOTH: {m1.grad.max():.10f}\n  REF: {m2.grad.max():.10f}\nMax abs diff: {torch.max(torch.abs(m1.grad - m2.grad)):.10f}\nMean abs diff: {torch.mean(torch.abs(m1.grad - m2.grad)):.10f}"
        )
        print()

Note: there are small inconsistencies between prepare_model_for_kbit_training in unsloth vs. huggingface peft when doing QLoRA fine-tuning -- peft upcasts all non-INT-8 params to fp32 -- see here.

I added an upcast kwarg to unsloth FastLanguageModel.get_peft_model that is passed to prepare_model_for_kbit_training to replicate this behavior:

def prepare_model_for_kbit_training(
    model: Any,
    use_gradient_checkpointing: bool = True,
    use_reentrant: Optional[bool] = True,
    upcast=False,
) -> Any:
    """
    Calculates where to place the gradient checkpoints given n_layers.
    We also freeze all other layers's gradients

    Args:
        model: Any LlamaModel with layers.
        use_gradient_checkpointing (`bool`, *optional*):
            Default enabled. Provides memory savings by not saving all activations,
            but only some.
        use_reentrant (`bool`, *optional*):
            https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L354
            Optimal gradient checkpointing algorithm which will be the default in
            future Pytorch versions.
    """

    # Freeze all parameters
    for param in model.parameters():
        param.requires_grad_(False)

    # Cast non INT8 parameters to fp32
    if upcast:
        for param in model.parameters():
            if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
                param.data = param.data.to(torch.float32)

    if use_gradient_checkpointing:
        model.gradient_checkpointing_enable()

    # If use_reentrant = True which is the Pytorch default, we just make the input requires_grad.
    if use_reentrant:
        if hasattr(model, "enable_input_require_grads"):
            model.enable_input_require_grads()
        else:

            def make_inputs_require_grad(module, input, output):
                output.requires_grad_(True)

            model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

    return model

Here is the output from running the above script:

Checking mlp grads:
gate_proj.lora_A
Mean grad:
  UNSLOTH: 0.0441589355
  REF: 0.0000020351
Max abs diff: 0.1207160950
Mean abs diff: 0.0097856047

gate_proj.lora_B
Mean grad:
  UNSLOTH: 0.0051155090
  REF: 0.0000001698
Max abs diff: 0.0086461902
Mean abs diff: 0.0002924677

up_proj.lora_A
Mean grad:
  UNSLOTH: 0.0850219727
  REF: -0.0000299520
Max abs diff: 0.1020736694
Mean abs diff: 0.0135316616

up_proj.lora_B
Mean grad:
  UNSLOTH: 0.0048866272
  REF: -0.0000000757
Max abs diff: 0.0068296790
Mean abs diff: 0.0002973406

down_proj.lora_A
Mean grad:
  UNSLOTH: 0.0928344727
  REF: -0.0000352956
Max abs diff: 0.2047328949
Mean abs diff: 0.0073212739

down_proj.lora_B
Mean grad:
  UNSLOTH: 0.0037288666
  REF: 0.0000003116
Max abs diff: 0.0040407181
Mean abs diff: 0.0002820148

Checking attn grads:
q_proj.lora_A
Mean grad:
  UNSLOTH: -0.0000000000
  REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000

q_proj.lora_B
Mean grad:
  UNSLOTH: 0.0000000000
  REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000

k_proj.lora_A
Mean grad:
  UNSLOTH: -0.0000000000
  REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000

k_proj.lora_B
Mean grad:
  UNSLOTH: -0.0000000000
  REF: 0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000

v_proj.lora_A
Mean grad:
  UNSLOTH: 0.1055297852
  REF: 0.1329345703
Max abs diff: 0.1655731201
Mean abs diff: 0.0144135132

v_proj.lora_B
Mean grad:
  UNSLOTH: 0.0139694214
  REF: 0.0166625977
Max abs diff: 0.0193632841
Mean abs diff: 0.0024413881

o_proj.lora_A
Mean grad:
  UNSLOTH: 0.1630859375
  REF: 0.1149902344
Max abs diff: 0.1842651367
Mean abs diff: 0.0191203523

o_proj.lora_B
Mean grad:
  UNSLOTH: 0.0102157593
  REF: 0.0053596497
Max abs diff: 0.0119572878
Mean abs diff: 0.0010805393

Thoughts?

Jan 20 '24 22:01 jeromeku

@jeromeku Great work! Some pointers:

torch.manual_seed sadly does not actually work on GPUs - torch.cuda.manual_seed is the one you want!!

torch.randn can also take device = "cuda" - so I guess my first point of manual_seed is irrelevant since ur copying from CPU to GPU

Yep one issue is the upcasting to float32 which is one of the optimizations we found for VRAM reduction.

You can see there are error differences - mainly due to Flash Attention - Pytorch does Q @ K.T and other attention ops in float16, whilst FA upcasts internally to fp32, which makes it more equivalent to full float32 training - hence the error differences.

I think the reference model you used does not have FA enabled.

But ye - great work again - super useful script :)))

Jan 21 '24 03:01 danielhanchen

@danielhanchen

What do you consider permissible range of gradient discrepancies between the unsloth and the reference HF implementation?

I.e., there are differences (e.g., up_proj) that are on the same order of magnitude as the mean grads themselves -- can this be chalked up to the use of f32 vs f16...

Jan 21 '24 03:01 jeromeku

@jeromeku Ye one of the issues I found as well when verifying Unsloth vs normal HF - thats what I for now opted to just compare training losses directly

Jan 21 '24 04:01 danielhanchen

@danielhanchen

Just wanted to give a quick update:

I have a working implementation of gptq fast_lora working.
- I patched in a triton quantized matmul kernel into the existing fused forward / backward layers
- Training works and is the losses are on par with the default HF gptq fine-tuner (the non-fused, torch-only GPTQ fine-tuning model if you provide a gptq quantized model to the standard from_pretrained loader).
- However, the training runs are slower than the default HF model (and also the unsloth bnb version).
Need to do some additional profiling / debugging to see where the problems are and whether a torch.compile version of the quantized matmul kernel outperforms the triton kernel.

Jan 26 '24 06:01 jeromeku

@jeromeku Super great work! Are you testing it on a Tesla T4 or Ampere based GPU? I found older GPUs Triton kernels to be noticeably slower.

Also I found through experimentation instead of writing 1 full fused kernel for matrix mult and dequantization, to split it into 2. The dequant step should only take 1-2ms, whilst the matrix mult takes 30ms or so. The compiler can be "confused" on the dequant steps, causing it to not optimize correctly, so I found using torch.matmul to be most effective.

Jan 26 '24 12:01 danielhanchen

@danielhanchen I've been testing on an Ampere-based GPU (A6000).

Going to do some additional profiling to determine bottlenecks vs. vanilla HF implementation and the unsloth bnb version.
Additional optimizations after above analysis.
Will post a draft PR to make collab easier.

Jan 26 '24 16:01 jeromeku

@jeromeku Oh ok cool! If I have to guess, it's that NVCC / the Trtion compiler is not optimizing "properly" - also did u use the matmul Triton autotuner? It could be that maybe?

Jan 26 '24 18:01 danielhanchen

unsloth unsloth copied to clipboard

[Feature request] Support GPTQ quantization

unsloth
unsloth copied to clipboard