quanto Write a helper to reload a quantized state

quantized weights, scales and metadata can be quantized into a state_dict that can later be reloaded and applied to a quantized model.

The process is a bit convoluted, as it requires the target model to be quantized first without any parameters (to make it quantization "aware").

The goal of this issue is to implement a helper in quantize.py with the following signature:

def requantize(model: torch.nn.Module, state_dict: Dict[str, Union[torch.Tensor, str]):

The helper will simply quantize the model and reload the state_dict, assigning the Tensors to also apply the correct dtypes.

The most important part of the issue is to write dedicated unit tests to check it works in every configuration.

Test could for instance be added in a new test/model/test_requantize.py test file.

Apr 11 '24 11:04 dacorvo

Hi @dacorvo , My initial approach was

quantize the model using quantize function.
Iterate the state_dict and check if it's value is tensor type then it's weight so I will use quantize_weight to qauntize weights else it's metadata .[How do I quantize the metadata ?]

  state_dict[name]  = qz(weight)  - Here qz - quantize_weight() from 
 state_dict[name] = qx(metadata) - here qx - How do I quantize the metadata ?

What is scales in state_dict ? are you referring bias ?

What do you think of my approach ?

Apr 11 '24 13:04 ManoBharathi93

The goal of the issue is NOT to create a quantized state_dict: this is already handled. The goal here is simply to wrap the operations done in sequence when reloading quantized weights. See for instance: https://github.com/huggingface/quanto/blob/b9ee78335a6f0f90363da5909b5b749a1beaa4ce/test/model/test_quantize_mlp.py#L139

Apr 11 '24 13:04 dacorvo

If I am not wrong, this is what we need to do it in helper requantize model = quantize(model) model.load_state_dict(state_dict,assign=True)

Apr 11 '24 14:04 ManoBharathi93

Except that it is:

quantize(model)

because quantization happens in place

Apr 11 '24 15:04 dacorvo

Thanks, I will think about unit test !

Apr 11 '24 15:04 ManoBharathi93

Looks like Mano is already working on this issue. I will keep an eye on this issue, in case his solution isn't accepted for whatever reason.

Apr 11 '24 16:04 calmitchell617

Cal If I am not wrong, test function can be further extended..

The process is a bit convoluted, as it requires the target model to be quantized first without any parameters (to make it quantization "aware").

@calmitchell617 , As david said. we need to quantize model first without any parameters but in my test case is quantize model with parameters.

Apr 12 '24 02:04 ManoBharathi93

Ok, I will make a contribution soon.

I think it is important to consider (and test) the use case of requantizing a large Huggingface Transformers model. It was trivial to requantize the MLP class in the existing tests, but it was more difficult to do so with a model loaded via from_pretrained(). The Transformers model loaded and worked fine, but it took some tinkering to get it to load in a memory efficient way.

Here are two minimal scripts I wrote to quantize, then requantize a Transformers model, while attempting to minimize loading time and memory usage:

Code to create a quantized state dict from a HF Transformers model

from transformers import AutoModelForCausalLM
from quanto import quantize, freeze, qint4, safe_save

model = AutoModelForCausalLM.from_pretrained(
    'codellama/CodeLlama-7b-Instruct-hf',
    torch_dtype='auto',
)

quantize(model, weights=qint4)
freeze(model)
safe_save(model.state_dict(), 'llama-7b.sd')

Code to requantize the model (for inference)

Note the usage of the meta, cpu, and cuda devices, along with the to_empty() function.

from transformers import AutoModelForCausalLM
from torch import device as torch_device
from quanto import quantize, safe_load
from torch.cuda import memory_allocated

meta = torch_device('meta')
cpu = torch_device('cpu')
gpu = torch_device('cuda:0')

with meta:
    model = AutoModelForCausalLM.from_pretrained(
        'codellama/CodeLlama-7b-Instruct-hf',
        torch_dtype='auto',
    )
    quantize(model)

model.to_empty(device=cpu)
state_dict = safe_load('llama-7b.sd')
model.load_state_dict(state_dict)
model.to(gpu)

print(f'cuda memory used in GB: {memory_allocated(gpu) / 1e9}')

These scripts load the Llama 7B model into ~4.17GB of VRAM, without any appreciable CPU RAM being used. This low memory usage is important, because that is frequently the reason people will be looking to use Quanto in the first place.

@dacorvo, would you please look at the second script above? If you think my methodology looks OK, I will turn it into a function and adapt it to your testing scheme.

Apr 12 '24 11:04 calmitchell617

@calmitchell617 yes that looks correct. Since the initial instantiation of the model happens outside of quanto (here using transformers), I am just wondering how you can enforce the whole sequence.

Apr 12 '24 11:04 dacorvo

That's a valid concern. I will think about that when writing the function and test accordingly.

Apr 12 '24 11:04 calmitchell617

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 13 '24 01:05 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

May 19 '24 01:05 github-actions[bot]

quanto
quanto copied to clipboard

Write a helper to reload a quantized state_dict

Code to create a quantized state dict from a HF Transformers model

Code to requantize the model (for inference)

quanto quanto copied to clipboard

Write a helper to reload a quantized state_dict

Code to create a quantized state dict from a HF Transformers model

Code to requantize the model (for inference)

quanto
quanto copied to clipboard