Hqq serialization
Follow-up to https://github.com/huggingface/transformers/pull/32379
1/3
https://github.com/huggingface/transformers/pull/33141/commits/5cb7d81547908dea660f525be5f77d9065b6edeb
Removed the check_old_param hack.
The problem however is that HQQLinear.state_dict is huge, which makes loading extremely slow. So I added run_expected_keys_check which skips those checks for HQQLinear params. I am not sure if it's a clean way. If you just init a dummy HQQLinear you wouldn't get all the state_dict params anyway :thinking: so if you disable that check it will complain that the parameters is not in the expected keys, let me know if there's a better way of doing this
2/3: Multi-gpu loading Loading on multi-gpu looks like it's working fine. There's an issue with the BitBlas backend I just reported here Forcing the input to use the same device was done on the hqq lib side.
3/3: state_dict on the same safetensor chunk I run tests with different models and it's working fine ( gist):
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct' #OK
model_id = 'meta-llama/Meta-Llama-3-70B' #OK
model_id = "facebook/opt-125m" #OK
model_id = "meta-llama/Llama-2-13b-chat-hf" #OK
model_id = "microsoft/Phi-3-mini-128k-instruct" #OK
model_id = "google/gemma-2-9b-it" #OK
model_id = "google/gemma-2-2b" #OK
so I think for the moment we can leave it until someone reports some issue, I can't reproduce the problem anyway.
Next steps:
- Revisit the comments above (@mobicham )
- Change/disable settings for hqqConfig because now saving/loading doesn't support quant scales/zeros as well as meta-data offloading. Need to deprecate it as well on the hqq lib side and a new pip version 2.0.0 (@mobicham )
@SunMarc
- Reverted back to
if isinstance(module, (torch.nn.Linear, HQQLinear)):but we still need thatrun_expected_keys_checkotherwise it breaks - Updated the default
HqqConfigdefault params sincequant_scale,quant_zero, andoffload_metaare now deprecated. Also done on the hqq-lib side. I also updated the tests, the doc and made a new hqq lib pip release0.2.0
Regarding this: https://github.com/huggingface/transformers/pull/33141#discussion_r1734388659
The issue is that to remove that additional check, we need to have all the HQQLinear dict keys for each layer in the list of expected keys. There are 19 keys per HQQLinear module. For a small model like LLama3-8B, that means 32*7*19=4256 checks per parameter which is extremely slow
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
There are TODOs to be done before merging:
- Check if adding a bias on architectures that don't support the bias by default breaks the hqq model loading.
- Trying to get rid of
run_expected_keys_checkby updating theexpected_keys. This will require some modification on the hqq lib side as well to return the list of all the valid keys of the state dict. Then bump up min hqq lib version in transformers.
There are TODOs to be done before merging:
- Check if adding a bias on architectures that don't support the bias by default breaks the hqq model loading.
✅ Checked, all good!
- Trying to get rid of
run_expected_keys_checkby updating theexpected_keys. This will require some modification on the hqq lib side as well to return the list of all the valid keys of the state dict. Then bump up min hqq lib version in transformers.
❌ Removed the run_expected_keys_check hack by extending expected_keys in the case where HQQLinear state dict is loaded but Linear is present instead.
https://github.com/huggingface/transformers/pull/33141/commits/7e019b3619a2ba6972e409ce39b009210c467252
There's an issue with the bias on some architectures, I am investigating
Update: the second issue is resolved now ✅
You can test with this gist: https://gist.github.com/mobicham/701dd564c52590203ee09631425ad797
Nice could you just update a bit the description of the PR ?
@ArthurZucker just a friendly reminder to review this PR when you have a moment. Let me know if you need any clarifications or if there’s anything I can help with. Thank you very much :pray:
Waiting for this!
Just for curiosity, what miss to merge?
Just for curiosity, what miss to merge?
Waiting for @mobicham to check the latest review and give me to heads-up to merge ! This should be done soon ! Also it looks like that there are some conflits to fix
Thanks for iterating @mobicham! Merging!
@mobicham minor documentation issue, but the transformers documentation page for quantization has a giant features matrix which still says serialization of HQQ models is not supported
https://huggingface.co/docs/transformers/main/quantization/overview
Would you like to open a PR to fix this @rohit-gupta ?
@rohit-gupta thanks for flagging !
now model.save_pretrained(save_path) give this:
Traceback (most recent call last):
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 35, in <module>
model.save_pretrained(save_path)
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2932, in save_pretrained
state_dict_split = split_torch_state_dict_into_shards(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 330, in split_torch_state_dict_into_shards
return split_state_dict_into_shards_factory(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_base.py", line 108, in split_state_dict_into_shards_factory
storage_id = get_storage_id(tensor)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 382, in get_torch_storage_id
if tensor.device.type == "meta":
^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'device'
@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?
@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?
I think so. I didn't had this problem in the release of hqq in transformers. hqq version: 0.2.3 transformers version: 4.47.0.dev0
@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?
I think so. I didn't had this problem in the release of hqq in transformers. hqq version: 0.2.3 transformers version: 4.47.0.dev0
@SunMarc do you know what was changed by any chance?
Transformers version 4.48.0.dev0 still has this problem...
Any one from the HF team can track down this problem please? What changed ? Nothing on the hqq lib side changed much.
@SunMarc ?
Can you share your script @blap ? I'll have a look asap !
Can you share your script @blap ? I'll have a look asap !
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
model_id = "mllmTeam/PhoneLM-1.5B"
repo = "PhoneLM-1.5B"
nbits = 4
group_size = None
axis = 0
save_path = repo+"-nbits"+str(nbits)+"-GS"+str(group_size)+"-Axis"+str(axis)+"-HQQ2"
cache_dir = repo+"-cache"
device = "cpu"
compute_dtype = torch.float16
#Quantize
quant_config = HqqConfig(nbits=nbits, group_size=group_size, axis=axis, quant_scale=False, quant_zero=False)
#Load the model
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=compute_dtype,
cache_dir=cache_dir,
device_map=device,
quantization_config=quant_config,
low_cpu_mem_usage=True,
trust_remote_code=True
)
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
# Save
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
Error:
Traceback (most recent call last):
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 32, in <module>
model.save_pretrained(save_path)
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2971, in save_pretrained
state_dict_split = split_torch_state_dict_into_shards(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 369, in split_torch_state_dict_into_shards
return split_state_dict_into_shards_factory(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_base.py", line 108, in split_state_dict_into_shards_factory
storage_id = get_storage_id(tensor)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 746, in get_torch_storage_id
if tensor.device.type == "meta":
^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'device'
So... Any ideas how to save?
@blap why don't you use the latest release ? It works fine last time I tried (last week)
@blap why don't you use the latest release ? It works fine last time I tried (last week)
Which version do you use?
Version 4.45.2 give me this:
Traceback (most recent call last):
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 37, in <module>
model.save_pretrained(save_path)
File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2565, in save_pretrained
raise ValueError(
ValueError: The model is quantized with QuantizationMethod.HQQ and is not serializable - check out the warnings from the logger on the traceback to understand the reason why the quantized model is not serializable.
@blap 4.47.0 works for sure