quanto
quanto copied to clipboard
Write a helper to reload a quantized state_dict
quantized weights, scales and metadata can be quantized into a state_dict that can later be reloaded and applied to a quantized model.
The process is a bit convoluted, as it requires the target model to be quantized first without any parameters (to make it quantization "aware").
The goal of this issue is to implement a helper in quantize.py
with the following signature:
def requantize(model: torch.nn.Module, state_dict: Dict[str, Union[torch.Tensor, str]):
The helper will simply quantize the model and reload the state_dict, assigning the Tensors to also apply the correct dtypes.
The most important part of the issue is to write dedicated unit tests to check it works in every configuration.
Test could for instance be added in a new test/model/test_requantize.py
test file.
Hi @dacorvo , My initial approach was
- quantize the model using quantize function.
- Iterate the state_dict and check if it's value is tensor type then it's weight so I will use quantize_weight to qauntize weights else it's metadata .[How do I quantize the metadata ?]
state_dict[name] = qz(weight) - Here qz - quantize_weight() from
state_dict[name] = qx(metadata) - here qx - How do I quantize the metadata ?
What is scales in state_dict ? are you referring bias ?
What do you think of my approach ?
The goal of the issue is NOT to create a quantized state_dict: this is already handled. The goal here is simply to wrap the operations done in sequence when reloading quantized weights. See for instance: https://github.com/huggingface/quanto/blob/b9ee78335a6f0f90363da5909b5b749a1beaa4ce/test/model/test_quantize_mlp.py#L139
If I am not wrong, this is what we need to do it in helper requantize
model = quantize(model) model.load_state_dict(state_dict,assign=True)
Except that it is:
quantize(model)
because quantization happens in place
Thanks, I will think about unit test !
Looks like Mano is already working on this issue. I will keep an eye on this issue, in case his solution isn't accepted for whatever reason.
Cal If I am not wrong, test function can be further extended..
The process is a bit convoluted, as it requires the target model to be quantized first without any parameters (to make it quantization "aware").
@calmitchell617 , As david said. we need to quantize model first without any parameters but in my test case is quantize model with parameters.
Ok, I will make a contribution soon.
I think it is important to consider (and test) the use case of requantizing a large Huggingface Transformers model. It was trivial to requantize the MLP
class in the existing tests, but it was more difficult to do so with a model loaded via from_pretrained()
. The Transformers model loaded and worked fine, but it took some tinkering to get it to load in a memory efficient way.
Here are two minimal scripts I wrote to quantize, then requantize a Transformers model, while attempting to minimize loading time and memory usage:
Code to create a quantized state dict from a HF Transformers model
from transformers import AutoModelForCausalLM
from quanto import quantize, freeze, qint4, safe_save
model = AutoModelForCausalLM.from_pretrained(
'codellama/CodeLlama-7b-Instruct-hf',
torch_dtype='auto',
)
quantize(model, weights=qint4)
freeze(model)
safe_save(model.state_dict(), 'llama-7b.sd')
Code to requantize the model (for inference)
Note the usage of the meta, cpu, and cuda devices, along with the to_empty()
function.
from transformers import AutoModelForCausalLM
from torch import device as torch_device
from quanto import quantize, safe_load
from torch.cuda import memory_allocated
meta = torch_device('meta')
cpu = torch_device('cpu')
gpu = torch_device('cuda:0')
with meta:
model = AutoModelForCausalLM.from_pretrained(
'codellama/CodeLlama-7b-Instruct-hf',
torch_dtype='auto',
)
quantize(model)
model.to_empty(device=cpu)
state_dict = safe_load('llama-7b.sd')
model.load_state_dict(state_dict)
model.to(gpu)
print(f'cuda memory used in GB: {memory_allocated(gpu) / 1e9}')
These scripts load the Llama 7B model into ~4.17GB of VRAM, without any appreciable CPU RAM being used. This low memory usage is important, because that is frequently the reason people will be looking to use Quanto in the first place.
@dacorvo, would you please look at the second script above? If you think my methodology looks OK, I will turn it into a function and adapt it to your testing scheme.
@calmitchell617 yes that looks correct. Since the initial instantiation of the model happens outside of quanto (here using transformers), I am just wondering how you can enforce the whole sequence.
That's a valid concern. I will think about that when writing the function and test accordingly.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.