quanto icon indicating copy to clipboard operation
quanto copied to clipboard

Saving and loading quantized models doesn't work?

Open tanishqkumar opened this issue 10 months ago • 17 comments

I'm interested in profiling how well various architectures do after quantizing to various WxAx, and I'm using lm-eval to do so. lm-eval needs a path where a model is saved, but it seems that if one does AutoModelForCausalLM.from_pretrained(path) after doing model.save_pretrained(path) on a model model that was quantized using quanto, the quantized layers do not persist. This is problematic for lm-eval which reads saved models from a given path, so when it reads a model that was quantized then saved, it will get a regular unquantized model via from_pretrained. Is there any way around this, or plan to add a way to save quantization information in the save_pretrained and from_pretrained methods?

tanishqkumar avatar Mar 26 '24 20:03 tanishqkumar

Did you quantize your model using transformers or quanto ? The transformers integration saves the quantization config during serialization. If you quantized your model using quanto, the quantization is serialized, but only quantized models can reload it for now. As a workaround you can quantize the new model first with dummy parameters (quantize(model)) before reloading the serialized one.

dacorvo avatar Mar 27 '24 15:03 dacorvo

@tanishqkumar did you freeze() after you quantized? I could only save the quantized layers after I froze

lsb avatar Mar 28 '24 16:03 lsb

To complete @dacorvo answer, we plan to add support via save_pretrained and from_pretrained for transformers integration. For now, the only way to save the model is by using quanto.

SunMarc avatar Apr 02 '24 13:04 SunMarc

what's the right way to save/load the models after quantizing? is there an example we can refer to?

pratyushpal avatar Apr 04 '24 13:04 pratyushpal

So first, do not forget to freeze you model to statically convert your weights. Then, you can look at: https://github.com/huggingface/quanto/blob/main/examples/vision/image-classification/mnist/quantize_mnist_model.py Or in the tests: https://github.com/huggingface/quanto/blob/b9ee78335a6f0f90363da5909b5b749a1beaa4ce/test/model/test_quantize_mlp.py#L107

dacorvo avatar Apr 04 '24 13:04 dacorvo

Thank you for your quick response! I'm working with text generation and following 'quantize_causal_lm_model.py' in the examples with only weights during quantization. There isn't an example of how to save the model there. I'm guessing the right way to save/load is using safe_save and safe_load.

I have something like:

model = AutoModelForCausalLM.from_pretrained(model_path)
quantize(model, weights=weights)
freeze(model)
safe_save(model.state_dict(), state_dict_path)

# loading the state dict:
model_q = AutoModelForCausalLM.from_pretrained(model_path) 
model_q.load_state_dict(safe_load(state_dict_path)) # torch.load gives an error here 

What's the right way of loading the state dict in this situation?

pratyushpal avatar Apr 04 '24 14:04 pratyushpal

You need to quantize model_q, because otherwise the model does not know how to deal with quantized weights.

model = AutoModelForCausalLM.from_pretrained(model_path)
quantize(model, weights=weights)
freeze(model)
safe_save(model.state_dict(), state_dict_path)

# loading the state dict:
model_q = AutoModelForCausalLM.from_pretrained(model_path) 
quantize(model_q) # Parameters are unimportant because they will be overridden with what's in the state_dic
model_q.load_state_dict(safe_load(state_dict_path)) # torch.load gives an error here 

dacorvo avatar Apr 04 '24 14:04 dacorvo

Hello, like pratyushpal, I also found the quantize_causal_lm_model.py example then attempted save_pretrained().

I see the help wanted tag, what kind of help is needed? Maybe I can chip in.

calmitchell617 avatar Apr 11 '24 10:04 calmitchell617

Sorry, I planned to use the label to identify issues that are actually support requests, but "help wanted" is also a call to contributions, so that was not such a great idea.

dacorvo avatar Apr 11 '24 10:04 dacorvo

Ok, no problem.

For anyone else who comes here looking for an example, here is one that is working for me. @dacorvo, does it look OK to you?

Also, a follow up question - it is taking quite a long time to load the quantized model for inference, with this methodology. Would that be fixed by the upcoming integration with save_pretrained()?

To quantize and save a model:

from transformers import AutoModelForCausalLM
from quanto import freeze, qint8, quantize, safe_save
from pathlib import Path
from time import time

model_id = 'codellama/CodeLlama-7b-Instruct-hf'
out_path = 'out/quantized'
overall_start = time()

# make sure out_path is empty
p = Path(out_path)
if p.is_dir():
    p.rmdir()
elif p.is_file():
    p.unlink()

start = time()
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    low_cpu_mem_usage=True,
)
print(f'Finished loading model, time taken: {time() - start:.2f} seconds')

print('Quantizing model')
start = time()
quantize(model, weights=qint8)
print(f'Finished quantizing model, time taken: {time() - start:.2f} seconds')

print('Freezing model')
start = time()
freeze(model)
print(f'Finished freezing model, time taken: {time() - start:.2f} seconds')

print('Saving model')
start = time()
safe_save(model.state_dict(), out_path)
print(f'Finished saving model, time taken: {time() - start:.2f} seconds')

print(f'Total time taken: {time() - overall_start:.2f} seconds')

Output on my computer:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.03it/s]
Finished loading model, time taken: 1.22 seconds
Quantizing model
Finished quantizing model, time taken: 32.18 seconds
Freezing model
Finished freezing model, time taken: 6.93 seconds
Saving model
Finished saving model, time taken: 8.56 seconds
Total time taken: 49.80 seconds

To load and run inference on a quantized model:

from transformers import AutoTokenizer, AutoModelForCausalLM
from time import time
from quanto import quantize, safe_load

model_id = 'codellama/CodeLlama-7b-Instruct-hf'
model_location = "out/quantized"
overall_start = time()

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

start = time()
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", low_cpu_mem_usage=True)
print(f'Finished loading model, time taken: {time() - start:.2f} seconds')

print('Quantizing model')
start = time()
quantize(model)
print(f'Finished quantizing model, time taken: {time() - start:.2f} seconds')

print('Loading state dict')
start = time()
model.load_state_dict(safe_load(model_location))
print(f'Finished loading state dict, time taken: {time() - start:.2f} seconds')

print('Moving model to cuda and setting to eval mode')
start = time()
model.to("cuda")
model.eval()
print(f'Finished moving model to cuda and setting to eval mode, time taken: {time() - start:.2f} seconds')

print(f'Total time taken for model loading and quantization: {time() - overall_start:.2f} seconds')

messages = [
    {"role": "system", "content": "You are a chatbot."},
    {"role": "user", "content": "What does it take to build a great LLM?"},
]
tokenized = tokenizer.apply_chat_template(
    messages,
    return_dict=True,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    padding=True,
)
# tokenized = tokenizer(input_text, return_tensors="pt", padding=True)
input_ids = tokenized.input_ids.to("cuda")
attention_mask = tokenized.attention_mask.to("cuda")

outputs = model.generate(input_ids, attention_mask=attention_mask, max_new_tokens=256)

print(tokenizer.decode(outputs[0]))

Output on my computer:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.83it/s]
Finished loading model, time taken: 1.20 seconds
Quantizing model
Finished quantizing model, time taken: 31.91 seconds
Loading state dict
Finished loading state dict, time taken: 6.23 seconds
Moving model to cuda and setting to eval mode
Finished moving model to cuda and setting to eval mode, time taken: 2.12 seconds
Total time taken for model loading and quantization: 41.69 seconds
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
<s> [INST] <<SYS>>
You are a chatbot.
<</SYS>>

What does it take to build a great LLM? [/INST]  Building a great LLM (Master of Laws) program requires a combination of academic rigor, practical experience, and a commitment to excellence in teaching and learning. Here are some key factors to consider:

1. Academic rigor: The LLM program should be rigorous and challenging, with a focus on advanced legal research and analysis. The curriculum should include a range of courses that cover relevant legal topics, including contracts, torts, intellectual property, and international law.
2. Practical experience: LLM students should have the opportunity to gain practical experience in their chosen area of law. This can be achieved through internships, clinics, or other hands-on learning experiences.
3. Teaching and learning: The LLM program should be designed to promote effective teaching and learning. This includes the use of innovative teaching methods, such as flipped classrooms and online learning, and the incorporation of technology to enhance student engagement and interaction.
4. Collaboration and networking: The LLM program should foster collaboration and networking among students, faculty, and alumni. This can be achieved through joint research projects, seminars, and other events that bring students

calmitchell617 avatar Apr 11 '24 10:04 calmitchell617

@calmitchell617 that's correct, thank you very much for this contribution. You may be able to reduce the model loading time by using the meta device. I admit is is a bit convoluted at the moment, but you can try what is done in this test: https://github.com/huggingface/quanto/blob/b9ee78335a6f0f90363da5909b5b749a1beaa4ce/test/model/test_quantize_mlp.py#L139

dacorvo avatar Apr 11 '24 11:04 dacorvo

I think a helper taking a model and a quantized state_dict as parameters and returning the quantized model might be a good idea.

dacorvo avatar Apr 11 '24 11:04 dacorvo

I'm a relative beginner, but would be happy to try building that function.

calmitchell617 avatar Apr 11 '24 11:04 calmitchell617

OK, let me write an issue to explain a bit more what I expect.

dacorvo avatar Apr 11 '24 11:04 dacorvo

Here you go: https://github.com/huggingface/quanto/issues/162.

dacorvo avatar Apr 11 '24 11:04 dacorvo

Great, I'll give it a shot!

calmitchell617 avatar Apr 11 '24 14:04 calmitchell617

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 12 '24 01:05 github-actions[bot]

Maybe it'd help if in the readme, where it says "When freezing a model, its float weights are replaced by quantized integer weights.", that section could either add "To preserve changes, you must freeze the model before running save_pretrained." or add a line to the code sample:

freeze(model)
model.save_pretrained(model_path)

Currently the only line in the repo with save_pretrained is in the external folder, which doesn't use quanto or freeze

mapmeld avatar Jun 01 '24 21:06 mapmeld

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 02 '24 01:07 github-actions[bot]

The recommended way to save a quanto model is through a state_dict that can later be reloaded using optimum.quanto.requantize.

dacorvo avatar Jul 02 '24 06:07 dacorvo

A paragraph could be added to the README, for instance using safetensors for serializing the state_dict.

dacorvo avatar Jul 02 '24 06:07 dacorvo