llm icon indicating copy to clipboard operation
llm copied to clipboard

30B model doesn't load

Open RCasatta opened this issue 2 years ago • 6 comments

Following the same steps works for 7B and 13B model, with the 30B parameters I get

thread 'main' panicked at 'Could not load model: Tensor tok_embeddings.weight has the wrong size in model file', llama-rs/src/main.rs:39:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

RCasatta avatar Mar 15 '23 16:03 RCasatta

Yes, I am experiencing the same

rfurlan avatar Mar 15 '23 23:03 rfurlan

Does the 30B model work for you in llama.cpp?

philpax avatar Mar 16 '23 01:03 philpax

Yes, it works as expected in llama.cpp

On Wed, Mar 15, 2023 at 6:40 PM Philpax @.***> wrote:

Does the 30B model work for you in llama.cpp?

— Reply to this email directly, view it on GitHub https://github.com/setzer22/llama-rs/issues/13#issuecomment-1471149329, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4HVFIDA2F2P3DBF3SUX3W4JVSPANCNFSM6AAAAAAV4CJ5VQ . You are receiving this because you commented.Message ID: @.***>

rfurlan avatar Mar 16 '23 01:03 rfurlan

This could be a discrepancy in size due to integer promotion rules / a potential overflow, since the sizes for 30B are gonna be larger. More liberal use of usize (#18) would probably help here.

Need to see if I can repro this.

setzer22 avatar Mar 16 '23 07:03 setzer22

I have the same error, happens on https://github.com/setzer22/llama-rs/blob/3ce15b1200a3419d31c2dbe44b4ebd569370409a/llama-rs/src/llama.rs#L446 with tensor.nelements() = 212992000, n_parts = 3, and nelementes = 53248000

llama.cpp returns the same values with n_parts = 4 instead so it's not an i32 issue, https://github.com/setzer22/llama-rs/blob/3ce15b1200a3419d31c2dbe44b4ebd569370409a/llama-rs/src/llama.rs#L102 should say 4 instead of 3 (4 is what llama.cpp has as well).

Making this changes makes the model work for me. (Aside: The load time for the 30B model is brutal, if it can be parallel then it's absolutely worth it)

mwbryant avatar Mar 16 '23 16:03 mwbryant

oh lol, good spot, you're correct that it's 4 in llama.cpp: https://github.com/ggerganov/llama.cpp/blob/721311070e31464ac12bef9a4444093eb3eaebf7/main.cpp#L34

@setzer22 can you do a quick change to main to fix that?

philpax avatar Mar 16 '23 16:03 philpax

Pushed! Sorry about that :sweat_smile: It was just a typo on my end.

setzer22 avatar Mar 16 '23 19:03 setzer22

I was just able to load 30B with the changes on main, but I'll wait for others to confirm before closing the issue.

setzer22 avatar Mar 16 '23 21:03 setzer22

@setzer22 Working on my machine with main, also the alpaca fine tuned model floating around works with the project :smile:.

mwbryant avatar Mar 16 '23 22:03 mwbryant

I confirm I can now load the 30B model with main.

But it fits barely if you have 64GB of ram

RCasatta avatar Mar 16 '23 23:03 RCasatta

@RCasatta you mean the f16 version? Yes, I wasn't able to load that one on my machine (32GB). But I'm able to load the quantized one just fine.

Anyway, closing since the issue is solved, but feel free to keep discussing 😄

setzer22 avatar Mar 16 '23 23:03 setzer22

@RCasatta you mean the f16 version?

Yes I meant the f16 version, I didn't know you can quantize 😅

RCasatta avatar Mar 17 '23 06:03 RCasatta