llama2.rs icon indicating copy to clipboard operation
llama2.rs copied to clipboard

Tensor has shape torch.Size([448, 1024]) ... this looks incorrect.

Open timfpark opened this issue 2 years ago • 9 comments

Thank you for building this - very interested in trying it. In my hands, when I try to export the model to a .bin I get the following error - is this something simple / user error?

(MacOS Ventura 13.5.1 w/ Conda Environment)

❯ python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True
CUDA extension not installed.
Traceback (most recent call last):
  File "/Users/timothypark/dev/llama2.rs/export.py", line 150, in <module>
    load_and_export(model_name, revision, output_path)
  File "/Users/timothypark/dev/llama2.rs/export.py", line 128, in load_and_export
    model = AutoGPTQForCausalLM.from_quantized(model_name,
  File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 105, in from_quantized
    return quant_func(
  File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 847, in from_quantized
    accelerate.utils.modeling.load_checkpoint_in_model(
  File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1409, in load_checkpoint_in_model
    load_offloaded_weights(model, state_dict_index, state_dict_folder)
  File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 727, in load_offloaded_weights
    set_module_tensor_to_device(model, param_name, "cpu", value=weight, fp16_statistics=fp16_statistics)
  File "/opt/homebrew/anaconda3/envs/pytorch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 281, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([448, 1024]) in "qzeros" (which has shape torch.Size([224, 1024])), this look incorrect.

timfpark avatar Aug 24 '23 16:08 timfpark

I think this might be related to a recent change in TheBloke repos (I've been dealing with the same thing). I think temporarily you might need to use the main checkout (which has settings like gptq-4bit-128g-actorder_False):

python export.py l70b.noact128.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ main

though I'll see if I can contribute a patch soon.

rachtsingh avatar Aug 24 '23 17:08 rachtsingh

Ah, you should be able to get the same specific model using the previous commit SHA:

python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ 3b2759aac1962b01959765a7f2918b09feda2680

Does that work?

rachtsingh avatar Aug 24 '23 17:08 rachtsingh

Thanks for the fast response!

It makes more progress with that for sure - but I still get this error:

torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Regular
torch.Size([8192])
Traceback (most recent call last):
  File "/Users/timothypark/dev/llama2.rs/export.py", line 150, in <module>
    load_and_export(model_name, revision, output_path)
  File "/Users/timothypark/dev/llama2.rs/export.py", line 139, in load_and_export
    export(model, output_path)
  File "/Users/timothypark/dev/llama2.rs/export.py", line 102, in export
    for i in range(p['n_layers']): serialize(model.layers[i].self_attn.q_proj)
  File "/Users/timothypark/dev/llama2.rs/export.py", line 42, in serialize
    w = k.weight
  File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'QuantLinear' object has no attribute 'weight'

timfpark avatar Aug 24 '23 18:08 timfpark

Somehow your model is using a QuantLinear rather than a GeneralQuantLinear. What does the output at the beginning (i.e. the printout of the model) look like?

I see something like this:

Exporting...
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (rotary_emb): LlamaRotaryEmbedding()
          (k_proj): GeneralQuantLinear(in_features=5120, out_features=5120, bias=True)
          (o_proj): GeneralQuantLinear(in_features=5120, out_features=5120, bias=True)
          (q_proj): GeneralQuantLinear(in_features=5120, out_features=5120, bias=True)
          (v_proj): GeneralQuantLinear(in_features=5120, out_features=5120, bias=True)
        )
        (mlp): LlamaMLP(
          (act_fn): SiLUActivation()
          (down_proj): GeneralQuantLinear(in_features=13824, out_features=5120, bias=True)
          (gate_proj): GeneralQuantLinear(in_features=5120, out_features=13824, bias=True)
          (up_proj): GeneralQuantLinear(in_features=5120, out_features=13824, bias=True)
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=5120, out_features=32000, bias=False)
)

rachtsingh avatar Aug 24 '23 19:08 rachtsingh

It looks like this:

❯ python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ 3b2759aac1962b01959765a7f2918b09feda2680
Downloading model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 36.7G/36.7G [10:26<00:00, 58.5MB/s]
CUDA extension not installed.
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 8192, padding_idx=0)
    (layers): ModuleList(
      (0-79): 80 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (rotary_emb): LlamaRotaryEmbedding()
          (k_proj): QuantLinear()
          (o_proj): QuantLinear()
          (q_proj): QuantLinear()
          (v_proj): QuantLinear()
        )
        (mlp): LlamaMLP(
          (act_fn): SiLUActivation()
          (down_proj): QuantLinear()
          (gate_proj): QuantLinear()
          (up_proj): QuantLinear()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=8192, out_features=32000, bias=False)
)
writing tok_embeddings...
Regular
torch.Size([32000, 8192])
Regular
torch.Size([8192])
...

timfpark avatar Aug 24 '23 19:08 timfpark

Give this version of the script a shot: https://gist.github.com/rachtsingh/17387f86a5d34cdcf495537610ef0b62

I'll try to get it merged in the next patch.

Just to check for others - do you have ExLlama installed? I think this line means that your QuantLinears are auto_gptq.nn_modules.qlinear.qlinear_exllama.QuantLinear instead of the GeneralQuantLinear I see, and then the classname check triggers. You can also just patch that one line of export.py if you'd like.

rachtsingh avatar Aug 24 '23 22:08 rachtsingh

Yes - that fixed it - thank you!

Do not have ExLlama installed as far as I know...

timfpark avatar Aug 25 '23 00:08 timfpark

That said, when I go to run the bin generated with the new script, I am getting this exception:

$ target/release/llama2_rs -c l70b.act64.bin -t 0.0 -s 11 -p "The only thing"
thread 'main' panicked at src/main.rs:106:9:
assertion `left == right` failed
  left: 38760382468
 right: 41166864388
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I've compiled with the following config, which I think is correct from the README.md:

"--cfg", 'model_size="70B"',
"--cfg", 'quant="Q_4"',
"--cfg", 'group_size="64"']

timfpark avatar Aug 25 '23 21:08 timfpark

@timfpark I'm getting the same error after following instructions on this thread - did you end up getting it working?

jacquayj avatar Nov 02 '23 18:11 jacquayj