hqq quantization with transformers of RWKV/v6-Finch-1B6-HF

I made a quantization with transformers of RWKV/v6-Finch-1B6-HF, but I got this error when load:

Traceback (most recent call last):
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq2b_RWKV_load.py", line 28, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\models\auto\auto_factory.py", line 559, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 4255, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 4828, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 873, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\accelerate\utils\modeling.py", line 286, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1, 1, 2048]) in "time_decay" (which has shape torch.Size([32, 64])), this looks incorrect.

Quantization:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

model_id      = "RWKV/v6-Finch-1B6-HF"
repo          = "v6-Finch-1B6-HF"
nbits         = 4
group_size    = 64
axis          = 1
save_path     = repo+"-HQQ"
cache_dir     = repo+"-cache"
device        = "cpu" # "cpu"    cuda:0
compute_dtype = torch.float16

#Quantize
quant_config  = HqqConfig(nbits=nbits, group_size=group_size, axis=axis)

#Load the model
print("model: "+str(model_id))
print("Quantize to: "+str(save_path))
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=compute_dtype, 
    cache_dir=cache_dir,
    device_map=device, 
    quantization_config=quant_config,
    low_cpu_mem_usage=True,
    trust_remote_code=True
)

#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, trust_remote_code=True)

# Save
print("saving...")
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

Load:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def generate_prompt(instruction, input=""):
    instruction = instruction.strip().replace('\r\n','\n').replace('\n\n','\n')
    input = input.strip().replace('\r\n','\n').replace('\n\n','\n')
    if input:
        return f"""Instruction: {instruction}

Input: {input}

Response:"""
    else:
        return f"""User: hi

Assistant: Hi. I am your assistant and I will provide expert full response in full details. Please feel free to ask any question and I will always answer it.

User: {instruction}

Assistant:"""


model = AutoModelForCausalLM.from_pretrained("RWKV/v6-Finch-1B6-HF", trust_remote_code=True, torch_dtype=torch.float16).to(0)
tokenizer = AutoTokenizer.from_pretrained("RWKV/v6-Finch-1B6-HF", trust_remote_code=True)

text = "Write an essay about large language models."
prompt = generate_prompt(text)

inputs = tokenizer(prompt, return_tensors="pt").to(0)
attention_mask = inputs["attention_mask"]
output = model.generate(inputs["input_ids"], attention_mask=attention_mask, max_new_tokens=128, do_sample=True, temperature=1.0, top_p=0.3, top_k=0, )
print(tokenizer.decode(output[0].tolist(), skip_special_tokens=True))

Jan 27 '25 19:01 blap

It seems this is more of a transformers issue: it's not an official transformers model (trust_remote=True), so it's difficult to make sure everything would work fine.

The model is actually very small and it takes a few seconds to quantize and load. Any reasons why you want to save the quantized version instead of just quantizing on-the-fly?

Jan 27 '25 19:01 mobicham

It seems this is more of a transformers issue: it's not an official transformers model (trust_remote=True), so it's difficult to make sure everything would work fine.

The model is actually very small and it takes a few seconds to quantize and load. Any reasons why you want to save the quantized version instead of just quantizing on-the-fly?

I would like to upload just to spread hqq. But it is a test too for big models.

Jan 27 '25 21:01 blap

Cool! Yeah unfortunately since RWKV doesn't have official support in transformers, there are no guarantees it's gonna work. There's probably a workaround with hqq lib but it's not gonna be safetensors

Jan 28 '25 08:01 mobicham

Hey, have you found a solution to this?

Apr 09 '25 11:04 mobicham