unsloth
unsloth copied to clipboard
[Feature request] Support GPTQ quantization
So I have a GPTQ llama model I downloaded (from TheBloke), and it's already 4 bit quantized. I have to pass in False for the load_in_4bit parameter of:
model, tokenizer = FastLlamaModel.from_pretrained(
because if I don't, I get an error thrown saying:
The model is already quantized with gptq. You can't quantize it again with bitsandbytes
But, if I pass in False for load_in_4bit, this code makes bnb_config be None:
bnb_config = None
if load_in_4bit:
bnb_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_use_double_quant = True,
bnb_4bit_quant_type = "nf4",
bnb_4bit_compute_dtype = dtype,
)
and that makes quantization_config be None as well:
quantization_config = bnb_config,
and that crashes here:
if hasattr(self, "quantization_config"):
output["quantization_config"] = (
self.quantization_config.to_dict()
with the error message:
'NoneType' object has no attribute 'to_dict'
So I'm not sure how to LoRA train this llama model. Any thoughts?
I tried adding:
[...] and self.quantization_config is not None:
to the end of that line there (and similar additions in two other places that came up), and it hasn't crashed, but it's now taking a very long time to load the model, so maybe it's doing some unwanted conversion?
Yeah, it finally 'loaded' but then it said some weights of the model checkpoint were not used when initializing LlamaForCausalLM, and it listed a giant list of weights, which I'm guessing was all of them.
The the LoRA training crashed with:
Cannot copy out of meta tensor; no data!
So something definitely did not go well.
@araleza Oh no I don't think GPTQ models are supported as of yet
Currently only QLoRA via bitsandbytes is supported, hence all the error messages. If GPTQ is a super popular request, I will add it in - the dequantization steps will just be replaced, but I will have to read up on how GPTQ does it internally.
For now, is it possible to use a non GPTQ quantized model?
For now, is it possible to use a non GPTQ quantized model?
I don't know actually... I've only done LoRA training with oobabooga's Training tab, and it can only do LoRA training with unquantized models, or GPTQ models (which you have to load with the Transformers loader). So I don't know how to load a quantized model of any format except GPTQ onto my GPU. Any tips for which format to use instead, but still have it fit on my 24GB GPU?
@araleza Would it be possible to try load a non quantized model, then pass load_in_4bit = True
via Unsloth? It should load on ur CPU / RAM then it quantizes then loads it into the GPU
I'll see for a future release if I can add GPTQ support!
I was atually just reading up upon HQQ (half quadratic quantization) https://github.com/mobiusml/hqq and maybe I'll be adding HQQ instead of GPTQ since HQQ has no need for data calibration, whilst GPTQ does.
Sounds good. I think you've got two groups of people who want to use your software:
- people who have a big model and big training data, and want the fine tuning to be faster
- people with 24GB cards who want to train larger models, but without quantizing them so badly that the training is meaningless.
Supporting HQQ would help the people in group 2, like me.
@araleza Cool I'll get on with HQQ! It seems like even Mixtral can supposedly fit on a 24GB card!
But HQQ supports 8, 4, 3 and 2 bit quantization so it'll be pretty useful!
@danielhanchen happy to pitch in with quantization (or other feature requests). let me know how best to contribute!
@jeromeku More than happy to collaborate! I was actually taking a look at GPTQ the other day - I guess technically Unsloth can add in GPTQ during training - we we need is to port the dequantization kernels from GPTQ to float16 / bfloat16, and if that works, then GPTQ will be supported.
For all, I'm using bitsandbytes's dequantization kernels.
Again more than happy to collaborate if you're interested!
@danielhanchen
That should work -- this is what QLoRA
does under the hood for non-LoRA
weights right? I.e., dequantizes 'frozen' weights to f16 / bf16
in order to pass grads through non-LoRA
layers.
I can take a crack at this if you're more keen working on hqq
...
@jeromeku I'll investigate GPTQ's dequant kernels as well! But if you're interested in adding GPTQ support - I'm more than happy for a few more OSS collaborators!
Essentially in terms of the main gist of things:
- Find how GPTQ dequantizes its quantized weights to float16 / bfloat16
- Extract this functionality from say Huggingface internals or some other provider like Exllama / llama.cpp etc
- Replace
fast_dequantize
with GPTQ equivalent kernels - Fix up a few lines where
Linear4bit
naming conventions are seen with GPTQ equivalent conventions. - If 3 works as is, then Unsloth is now GPTQ compatible!
If you wanna take a crack at that - I'll be super grateful! In fact just step 1 or 2 is enough for a general GPTQ integration!
@danielhanchen Will work on it!
@jeromeku Great! If you need any help - ask away! I guess we can use this Github issue as a central discussion area. I'll see if I have some time on GPTQ - probably next week ish - I'm trying to work on some other stuff currently.
Again thanks!
@danielhanchen
Trying to understand design decisions / coding style of the library.
What is the purpose of patching {Mistral, Llama}_fast_forward
when initializing Mistral
(pre_patch)? It seems you are extracting sections directly from the original HF
implementations of these layers (which already support flash-attn2
) and in some cases using xformers
for some of the ops.
Why the use of pass
after every function? This is (AFAIK) a rather unconventional python
coding style?
@jeromeku prepatch essentially just patches some portions of each function to call their relevant efficient implementation - ie as you mentioned some xformers
some FA2
.
Oh ye sorry on my coding style - I came from like like C++ / C background so I generally like all functions / if / for loops etc to be "enclosed" to make it "look" compartmentalized.
But you can have whatever coding style you like - for eg I like spaces between eqals during variance assignments, whilst general style is var=2
and not var = 2
. It definitely comes from my C background!!
If you're contributing code - I don't mind on style - that's the least of worries! :)) You can use any style you desire - it just has to work :)
@danielhanchen
Any tools / tests you use to check the correctness of gradient implementations?
@jeromeku Oh lol what I do is to get HF to do training, copy paste the training losses to Google Sheets, then with ur updated gradient implementation, log if the new training loss is mostly identical.
Another approach is to use torch.dist
or torch.all_close
on W.grad
and new_W.grad
to confirm the gradients. You'll have to do loss.backward(Y)
for eg to get the gradients.
@danielhanchen
Ok, was wondering if there was a more efficient way to do this verification. Was trying to use torch.autograd.gradcheck
but runs into issues with large inputs / outputs and mixed precision since it needs to realize the full VJP during numerical / analytical gradient calc.
I've adapted GPTQ code to re-implement fast_lora
custom fwd / bwd
and should have the rest done by early next week.
A minimal way to check the gradient is being calculated correctly -- akin to a unit test -- without having to do a training run would be a worthwhile effort both for existing and future implementations.
@jeromeku Actually I did technically make some functions to check gradients somewhere - I manaully made some random inputs and some random outputs, then backpropagated with torch.backward(outputs)
, and checked every item's .grad
to confirm it - I just need to find where I wrote it :))
@danielhanchen
I wrote a small test script to do gradient checking:
import torch
from datasets import load_dataset
# 4bit pre quantized models we support for 4x faster downloading!
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from torch.utils.data import DataLoader
from unsloth import FastLanguageModel
DTYPE = torch.float16
def get_model(
model_id="unsloth/mistral-7b-bnb-4bit",
reference=True,
max_seq_length=2048,
dtype=torch.float16,
load_in_4bit=True,
init_lora_weights=False,
upcast=True,
):
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_id,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
lora_config = LoraConfig(
r=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
task_type="CAUSAL_LM",
init_lora_weights=init_lora_weights,
)
if reference:
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": True},
)
model = get_peft_model(model, lora_config)
else:
config = lora_config.to_dict()
del config["task_type"]
model = FastLanguageModel.get_peft_model(
model,
use_gradient_checkpointing=True,
random_state=3407,
max_seq_length=max_seq_length,
upcast=upcast,
**config,
)
return model, tokenizer
ref_model, _ = get_model(dtype=DTYPE)
test_model, _ = get_model(dtype=DTYPE, reference=False)
def check_grad(model, dtype, seed=0, scale=1):
wrapped_model = model.model.model
embed_layer = wrapped_model.embed_tokens
self_attn = wrapped_model.layers[0].self_attn
mlp = wrapped_model.layers[0].mlp
torch.manual_seed(seed)
with torch.autocast(device_type="cuda", dtype=dtype):
# embeddings = embed_layer(inputs)
embeddings = torch.randn(
1, 1, embed_layer.weight.shape[1], dtype=dtype, requires_grad=True
).cuda()
print(f"Attention input dtype: {embeddings.dtype}")
attn_out, *_ = self_attn(embeddings)
print(f"Attn out dtype: {attn_out.dtype}")
mlp_out = mlp(attn_out)
torch.manual_seed(seed)
fake_grad_output = scale * torch.randn(mlp_out.shape, dtype=torch.float32).to(
mlp_out.device
)
mlp_out.backward(fake_grad_output)
return mlp_out, mlp, attn_out, fake_grad_output
mlp_out_ref, mlp_ref, attn_out_ref, fake_grad_ref = check_grad(ref_model, dtype=DTYPE)
print(
"Grad check after reference backwards:",
test_model.model.model.layers[0].mlp.down_proj.lora_B.default.weight.grad,
)
mlp_out, mlp, attn_out, fake_grad = check_grad(test_model, dtype=DTYPE)
ref_type = torch.float32
print()
print(
f"Checking fake grad (dY): {torch.allclose(fake_grad.to(ref_type), fake_grad_ref.to(ref_type))}"
)
# torch.max(torch.abs(fake_grad.to(ref_type) - fake_grad_ref.to(ref_type)))
# torch.allclose(mlp_out.to(ref_type), mlp_out_ref.to(ref_type))
print(f"Checking mlp grads:")
for (n1, m1), (n2, m2) in zip(mlp.named_parameters(), mlp_ref.named_parameters()):
if "lora" in n1 and "lora" in n2:
n1 = ".".join(n1.split(".")[:2])
print(f"{n1}")
print(
f"Mean grad:\n UNSLOTH: {m1.grad.max():.10f}\n REF: {m2.grad.mean():.10f}\nMax abs diff: {torch.max(torch.abs(m1.grad - m2.grad)):.10f}\nMean abs diff: {torch.mean(torch.abs(m1.grad - m2.grad)):.10f}"
)
print()
print("Checking attn grads:")
for (n1, m1), (n2, m2) in zip(
ref_model.model.model.layers[0].self_attn.named_parameters(),
test_model.model.model.layers[0].self_attn.named_parameters(),
):
if "lora" in n1 and "lora" in n2:
# torch.allclose(m1.grad.to(dtype), m2.grad.to(dtype))
n1 = ".".join(n1.split(".")[:2])
print(f"{n1}")
print(
f"Mean grad:\n UNSLOTH: {m1.grad.max():.10f}\n REF: {m2.grad.max():.10f}\nMax abs diff: {torch.max(torch.abs(m1.grad - m2.grad)):.10f}\nMean abs diff: {torch.mean(torch.abs(m1.grad - m2.grad)):.10f}"
)
print()
Note: there are small inconsistencies between prepare_model_for_kbit_training
in unsloth
vs. huggingface peft
when doing QLoRA fine-tuning -- peft
upcasts all non-INT-8
params to fp32
-- see here.
I added an upcast
kwarg
to unsloth
FastLanguageModel.get_peft_model
that is passed to prepare_model_for_kbit_training
to replicate this behavior:
def prepare_model_for_kbit_training(
model: Any,
use_gradient_checkpointing: bool = True,
use_reentrant: Optional[bool] = True,
upcast=False,
) -> Any:
"""
Calculates where to place the gradient checkpoints given n_layers.
We also freeze all other layers's gradients
Args:
model: Any LlamaModel with layers.
use_gradient_checkpointing (`bool`, *optional*):
Default enabled. Provides memory savings by not saving all activations,
but only some.
use_reentrant (`bool`, *optional*):
https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L354
Optimal gradient checkpointing algorithm which will be the default in
future Pytorch versions.
"""
# Freeze all parameters
for param in model.parameters():
param.requires_grad_(False)
# Cast non INT8 parameters to fp32
if upcast:
for param in model.parameters():
if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
param.data = param.data.to(torch.float32)
if use_gradient_checkpointing:
model.gradient_checkpointing_enable()
# If use_reentrant = True which is the Pytorch default, we just make the input requires_grad.
if use_reentrant:
if hasattr(model, "enable_input_require_grads"):
model.enable_input_require_grads()
else:
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
return model
Here is the output from running the above script:
Checking mlp grads:
gate_proj.lora_A
Mean grad:
UNSLOTH: 0.0441589355
REF: 0.0000020351
Max abs diff: 0.1207160950
Mean abs diff: 0.0097856047
gate_proj.lora_B
Mean grad:
UNSLOTH: 0.0051155090
REF: 0.0000001698
Max abs diff: 0.0086461902
Mean abs diff: 0.0002924677
up_proj.lora_A
Mean grad:
UNSLOTH: 0.0850219727
REF: -0.0000299520
Max abs diff: 0.1020736694
Mean abs diff: 0.0135316616
up_proj.lora_B
Mean grad:
UNSLOTH: 0.0048866272
REF: -0.0000000757
Max abs diff: 0.0068296790
Mean abs diff: 0.0002973406
down_proj.lora_A
Mean grad:
UNSLOTH: 0.0928344727
REF: -0.0000352956
Max abs diff: 0.2047328949
Mean abs diff: 0.0073212739
down_proj.lora_B
Mean grad:
UNSLOTH: 0.0037288666
REF: 0.0000003116
Max abs diff: 0.0040407181
Mean abs diff: 0.0002820148
Checking attn grads:
q_proj.lora_A
Mean grad:
UNSLOTH: -0.0000000000
REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000
q_proj.lora_B
Mean grad:
UNSLOTH: 0.0000000000
REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000
k_proj.lora_A
Mean grad:
UNSLOTH: -0.0000000000
REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000
k_proj.lora_B
Mean grad:
UNSLOTH: -0.0000000000
REF: 0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000
v_proj.lora_A
Mean grad:
UNSLOTH: 0.1055297852
REF: 0.1329345703
Max abs diff: 0.1655731201
Mean abs diff: 0.0144135132
v_proj.lora_B
Mean grad:
UNSLOTH: 0.0139694214
REF: 0.0166625977
Max abs diff: 0.0193632841
Mean abs diff: 0.0024413881
o_proj.lora_A
Mean grad:
UNSLOTH: 0.1630859375
REF: 0.1149902344
Max abs diff: 0.1842651367
Mean abs diff: 0.0191203523
o_proj.lora_B
Mean grad:
UNSLOTH: 0.0102157593
REF: 0.0053596497
Max abs diff: 0.0119572878
Mean abs diff: 0.0010805393
Thoughts?
@jeromeku Great work! Some pointers:
torch.manual_seed
sadly does not actually work on GPUs - torch.cuda.manual_seed
is the one you want!!
torch.randn
can also take device = "cuda"
- so I guess my first point of manual_seed
is irrelevant since ur copying from CPU to GPU
Yep one issue is the upcasting to float32
which is one of the optimizations we found for VRAM reduction.
You can see there are error differences - mainly due to Flash Attention - Pytorch does Q @ K.T
and other attention ops in float16, whilst FA upcasts internally to fp32, which makes it more equivalent to full float32 training - hence the error differences.
I think the reference model you used does not have FA enabled.
But ye - great work again - super useful script :)))
@danielhanchen
What do you consider permissible range of gradient discrepancies between the unsloth
and the reference HF
implementation?
I.e., there are differences (e.g., up_proj
) that are on the same order of magnitude as the mean grads themselves -- can this be chalked up to the use of f32
vs f16
...
@jeromeku Ye one of the issues I found as well when verifying Unsloth vs normal HF - thats what I for now opted to just compare training losses directly
@danielhanchen
Just wanted to give a quick update:
- I have a working implementation of gptq
fast_lora
working.- I patched in a
triton
quantized matmul kernel into the existing fused forward / backward layers - Training works and is the losses are on par with the default
HF
gptq
fine-tuner (the non-fused,torch
-onlyGPTQ
fine-tuning model if you provide agptq
quantized model to the standardfrom_pretrained
loader). - However, the training runs are slower than the default
HF
model (and also theunsloth
bnb
version).
- I patched in a
- Need to do some additional profiling / debugging to see where the problems are and whether a
torch.compile
version of the quantized matmul kernel outperforms thetriton
kernel.
@jeromeku Super great work! Are you testing it on a Tesla T4 or Ampere based GPU? I found older GPUs Triton kernels to be noticeably slower.
Also I found through experimentation instead of writing 1 full fused kernel for matrix mult and dequantization, to split it into 2. The dequant step should only take 1-2ms, whilst the matrix mult takes 30ms or so. The compiler can be "confused" on the dequant steps, causing it to not optimize correctly, so I found using torch.matmul
to be most effective.
@danielhanchen I've been testing on an Ampere-based GPU (A6000).
- Going to do some additional profiling to determine bottlenecks vs. vanilla
HF
implementation and theunsloth
bnb
version. - Additional optimizations after above analysis.
- Will post a
draft
PR to make collab easier.
@jeromeku Oh ok cool! If I have to guess, it's that NVCC / the Trtion compiler is not optimizing "properly" - also did u use the matmul Triton autotuner? It could be that maybe?