transformers Hqq serialization

Follow-up to https://github.com/huggingface/transformers/pull/32379

Aug 27 '24 11:08 mobicham

1/3 https://github.com/huggingface/transformers/pull/33141/commits/5cb7d81547908dea660f525be5f77d9065b6edeb Removed the check_old_param hack. The problem however is that HQQLinear.state_dict is huge, which makes loading extremely slow. So I added run_expected_keys_check which skips those checks for HQQLinear params. I am not sure if it's a clean way. If you just init a dummy HQQLinear you wouldn't get all the state_dict params anyway :thinking: so if you disable that check it will complain that the parameters is not in the expected keys, let me know if there's a better way of doing this

Aug 27 '24 11:08 mobicham

2/3: Multi-gpu loading Loading on multi-gpu looks like it's working fine. There's an issue with the BitBlas backend I just reported here Forcing the input to use the same device was done on the hqq lib side.

3/3: state_dict on the same safetensor chunk I run tests with different models and it's working fine ( gist):

model_id  = 'meta-llama/Meta-Llama-3-8B-Instruct' #OK
model_id  = 'meta-llama/Meta-Llama-3-70B' #OK 
model_id = "facebook/opt-125m" #OK
model_id = "meta-llama/Llama-2-13b-chat-hf" #OK
model_id = "microsoft/Phi-3-mini-128k-instruct" #OK
model_id = "google/gemma-2-9b-it" #OK
model_id = "google/gemma-2-2b" #OK

so I think for the moment we can leave it until someone reports some issue, I can't reproduce the problem anyway.

Next steps:

Revisit the comments above (@mobicham )
Change/disable settings for hqqConfig because now saving/loading doesn't support quant scales/zeros as well as meta-data offloading. Need to deprecate it as well on the hqq lib side and a new pip version 2.0.0 (@mobicham )

Aug 27 '24 15:08 mobicham

@SunMarc

Reverted back to if isinstance(module, (torch.nn.Linear, HQQLinear)): but we still need that run_expected_keys_check otherwise it breaks
Updated the default HqqConfig default params since quant_scale, quant_zero, and offload_meta are now deprecated. Also done on the hqq-lib side. I also updated the tests, the doc and made a new hqq lib pip release 0.2.0

Aug 28 '24 10:08 mobicham

Regarding this: https://github.com/huggingface/transformers/pull/33141#discussion_r1734388659 The issue is that to remove that additional check, we need to have all the HQQLinear dict keys for each layer in the list of expected keys. There are 19 keys per HQQLinear module. For a small model like LLama3-8B, that means 32*7*19=4256 checks per parameter which is extremely slow

Aug 28 '24 10:08 mobicham

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Aug 28 '24 16:08 HuggingFaceDocBuilderDev

There are TODOs to be done before merging:

Check if adding a bias on architectures that don't support the bias by default breaks the hqq model loading.
Trying to get rid of run_expected_keys_check by updating the expected_keys. This will require some modification on the hqq lib side as well to return the list of all the valid keys of the state dict. Then bump up min hqq lib version in transformers.

Aug 28 '24 17:08 mobicham

There are TODOs to be done before merging:

Check if adding a bias on architectures that don't support the bias by default breaks the hqq model loading.

✅ Checked, all good!

Trying to get rid of run_expected_keys_check by updating the expected_keys. This will require some modification on the hqq lib side as well to return the list of all the valid keys of the state dict. Then bump up min hqq lib version in transformers.

❌ Removed the run_expected_keys_check hack by extending expected_keys in the case where HQQLinear state dict is loaded but Linear is present instead. https://github.com/huggingface/transformers/pull/33141/commits/7e019b3619a2ba6972e409ce39b009210c467252 There's an issue with the bias on some architectures, I am investigating

Update: the second issue is resolved now ✅

Aug 29 '24 12:08 mobicham

You can test with this gist: https://gist.github.com/mobicham/701dd564c52590203ee09631425ad797

Aug 29 '24 14:08 mobicham

Nice could you just update a bit the description of the PR ?

Aug 29 '24 14:08 SunMarc

@ArthurZucker just a friendly reminder to review this PR when you have a moment. Let me know if you need any clarifications or if there’s anything I can help with. Thank you very much :pray:

Sep 11 '24 08:09 mobicham

Waiting for this!

Sep 17 '24 16:09 blap

Just for curiosity, what miss to merge?

Sep 26 '24 00:09 blap

Just for curiosity, what miss to merge?

Waiting for @mobicham to check the latest review and give me to heads-up to merge ! This should be done soon ! Also it looks like that there are some conflits to fix

Sep 26 '24 01:09 SunMarc

Thanks for iterating @mobicham! Merging!

Sep 30 '24 12:09 SunMarc

@mobicham minor documentation issue, but the transformers documentation page for quantization has a giant features matrix which still says serialization of HQQ models is not supported

https://huggingface.co/docs/transformers/main/quantization/overview

Nov 26 '24 23:11 rohit-gupta

Would you like to open a PR to fix this @rohit-gupta ?

Nov 27 '24 12:11 SunMarc

@rohit-gupta thanks for flagging !

Dec 02 '24 07:12 mobicham

now model.save_pretrained(save_path) give this:


Traceback (most recent call last):
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 35, in <module>
    model.save_pretrained(save_path)
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2932, in save_pretrained
    state_dict_split = split_torch_state_dict_into_shards(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 330, in split_torch_state_dict_into_shards
    return split_state_dict_into_shards_factory(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_base.py", line 108, in split_state_dict_into_shards_factory
    storage_id = get_storage_id(tensor)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 382, in get_torch_storage_id
    if tensor.device.type == "meta":
       ^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'device'

Dec 02 '24 19:12 blap

@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?

Dec 03 '24 08:12 mobicham

@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?

I think so. I didn't had this problem in the release of hqq in transformers. hqq version: 0.2.3 transformers version: 4.47.0.dev0

Dec 03 '24 11:12 blap

@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?

I think so. I didn't had this problem in the release of hqq in transformers. hqq version: 0.2.3 transformers version: 4.47.0.dev0

@SunMarc do you know what was changed by any chance?

Dec 03 '24 12:12 mobicham

Transformers version 4.48.0.dev0 still has this problem...

Dec 09 '24 12:12 blap

Any one from the HF team can track down this problem please? What changed ? Nothing on the hqq lib side changed much.

Dec 09 '24 14:12 mobicham

@SunMarc ?

Dec 17 '24 13:12 blap

Can you share your script @blap ? I'll have a look asap !

Dec 24 '24 11:12 SunMarc

Can you share your script @blap ? I'll have a look asap !


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

model_id      = "mllmTeam/PhoneLM-1.5B"
repo          = "PhoneLM-1.5B"
nbits         = 4
group_size    = None
axis          = 0
save_path     = repo+"-nbits"+str(nbits)+"-GS"+str(group_size)+"-Axis"+str(axis)+"-HQQ2"
cache_dir     = repo+"-cache"
device        = "cpu"
compute_dtype = torch.float16

#Quantize
quant_config  = HqqConfig(nbits=nbits, group_size=group_size, axis=axis, quant_scale=False, quant_zero=False)

#Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=compute_dtype, 
    cache_dir=cache_dir,
    device_map=device, 
    quantization_config=quant_config,
    low_cpu_mem_usage=True,
    trust_remote_code=True
)

#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

# Save
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

Error:


Traceback (most recent call last):
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 32, in <module>
    model.save_pretrained(save_path)
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2971, in save_pretrained
    state_dict_split = split_torch_state_dict_into_shards(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 369, in split_torch_state_dict_into_shards
    return split_state_dict_into_shards_factory(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_base.py", line 108, in split_state_dict_into_shards_factory
    storage_id = get_storage_id(tensor)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 746, in get_torch_storage_id
    if tensor.device.type == "meta":
       ^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'device'

Dec 24 '24 14:12 blap

So... Any ideas how to save?

Dec 30 '24 14:12 blap

@blap why don't you use the latest release ? It works fine last time I tried (last week)

Dec 30 '24 15:12 mobicham

@blap why don't you use the latest release ? It works fine last time I tried (last week)

Which version do you use?

Version 4.45.2 give me this:


Traceback (most recent call last):
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 37, in <module>
    model.save_pretrained(save_path)
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2565, in save_pretrained
    raise ValueError(
ValueError: The model is quantized with QuantizationMethod.HQQ and is not serializable - check out the warnings from the logger on the traceback to understand the reason why the quantized model is not serializable.

Dec 30 '24 16:12 blap

@blap 4.47.0 works for sure

Dec 30 '24 17:12 mobicham