Tensor has shape torch.Size([448, 1024]) ... this looks incorrect.
Thank you for building this - very interested in trying it. In my hands, when I try to export the model to a .bin I get the following error - is this something simple / user error?
(MacOS Ventura 13.5.1 w/ Conda Environment)
❯ python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True
CUDA extension not installed.
Traceback (most recent call last):
File "/Users/timothypark/dev/llama2.rs/export.py", line 150, in <module>
load_and_export(model_name, revision, output_path)
File "/Users/timothypark/dev/llama2.rs/export.py", line 128, in load_and_export
model = AutoGPTQForCausalLM.from_quantized(model_name,
File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 105, in from_quantized
return quant_func(
File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 847, in from_quantized
accelerate.utils.modeling.load_checkpoint_in_model(
File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1409, in load_checkpoint_in_model
load_offloaded_weights(model, state_dict_index, state_dict_folder)
File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 727, in load_offloaded_weights
set_module_tensor_to_device(model, param_name, "cpu", value=weight, fp16_statistics=fp16_statistics)
File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 281, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([448, 1024]) in "qzeros" (which has shape torch.Size([224, 1024])), this look incorrect.
I think this might be related to a recent change in TheBloke repos (I've been dealing with the same thing). I think temporarily you might need to use the main checkout (which has settings like gptq-4bit-128g-actorder_False):
python export.py l70b.noact128.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ main
though I'll see if I can contribute a patch soon.
Ah, you should be able to get the same specific model using the previous commit SHA:
python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ 3b2759aac1962b01959765a7f2918b09feda2680
Does that work?
Thanks for the fast response!
It makes more progress with that for sure - but I still get this error:
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Traceback (most recent call last):
File "/Users/timothypark/dev/llama2.rs/export.py", line 150, in <module>
load_and_export(model_name, revision, output_path)
File "/Users/timothypark/dev/llama2.rs/export.py", line 139, in load_and_export
export(model, output_path)
File "/Users/timothypark/dev/llama2.rs/export.py", line 102, in export
for i in range(p['n_layers']): serialize(model.layers[i].self_attn.q_proj)
File "/Users/timothypark/dev/llama2.rs/export.py", line 42, in serialize
w = k.weight
File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'QuantLinear' object has no attribute 'weight'
Somehow your model is using a QuantLinear rather than a GeneralQuantLinear. What does the output at the beginning (i.e. the printout of the model) look like?
I see something like this:
Exporting...
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 5120, padding_idx=0)
(layers): ModuleList(
(0-39): 40 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(rotary_emb): LlamaRotaryEmbedding()
(k_proj): GeneralQuantLinear(in_features=5120, out_features=5120, bias=True)
(o_proj): GeneralQuantLinear(in_features=5120, out_features=5120, bias=True)
(q_proj): GeneralQuantLinear(in_features=5120, out_features=5120, bias=True)
(v_proj): GeneralQuantLinear(in_features=5120, out_features=5120, bias=True)
)
(mlp): LlamaMLP(
(act_fn): SiLUActivation()
(down_proj): GeneralQuantLinear(in_features=13824, out_features=5120, bias=True)
(gate_proj): GeneralQuantLinear(in_features=5120, out_features=13824, bias=True)
(up_proj): GeneralQuantLinear(in_features=5120, out_features=13824, bias=True)
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=5120, out_features=32000, bias=False)
)
It looks like this:
❯ python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ 3b2759aac1962b01959765a7f2918b09feda2680
Downloading model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 36.7G/36.7G [10:26<00:00, 58.5MB/s]
CUDA extension not installed.
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 8192, padding_idx=0)
(layers): ModuleList(
(0-79): 80 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(rotary_emb): LlamaRotaryEmbedding()
(k_proj): QuantLinear()
(o_proj): QuantLinear()
(q_proj): QuantLinear()
(v_proj): QuantLinear()
)
(mlp): LlamaMLP(
(act_fn): SiLUActivation()
(down_proj): QuantLinear()
(gate_proj): QuantLinear()
(up_proj): QuantLinear()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=8192, out_features=32000, bias=False)
)
writing tok_embeddings...
Regular
torch.Size([32000, 8192])
Regular
torch.Size([8192])
...
Give this version of the script a shot: https://gist.github.com/rachtsingh/17387f86a5d34cdcf495537610ef0b62
I'll try to get it merged in the next patch.
Just to check for others - do you have ExLlama installed? I think this line means that your QuantLinears are auto_gptq.nn_modules.qlinear.qlinear_exllama.QuantLinear instead of the GeneralQuantLinear I see, and then the classname check triggers. You can also just patch that one line of export.py if you'd like.
Yes - that fixed it - thank you!
Do not have ExLlama installed as far as I know...
That said, when I go to run the bin generated with the new script, I am getting this exception:
$ target/release/llama2_rs -c l70b.act64.bin -t 0.0 -s 11 -p "The only thing"
thread 'main' panicked at src/main.rs:106:9:
assertion `left == right` failed
left: 38760382468
right: 41166864388
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I've compiled with the following config, which I think is correct from the README.md:
"--cfg", 'model_size="70B"',
"--cfg", 'quant="Q_4"',
"--cfg", 'group_size="64"']
@timfpark I'm getting the same error after following instructions on this thread - did you end up getting it working?