mergekit icon indicating copy to clipboard operation
mergekit copied to clipboard

Question: what format do the models need to be in for merging?

Open vmajor opened this issue 2 years ago • 14 comments

I must have missed something, but for the life of me I cannot find any information about the format of the models that mergekit expects.

Are they pytorch, safetensors? Can it use GGUF models because I see syntax that resembles huggingface-cli referencing TheBloke?

vmajor avatar Nov 28 '23 09:11 vmajor

The models can be in either pytorch or safetensors format . Quantized models like GGUF or GTPQ will not work - it needs to be a full fp16 or fp32 model.

Hope this helps!

cg123 avatar Nov 29 '23 02:11 cg123

Yes, thank you... I just forged bravely ahead and did it. Merging Orca 2 fails though and mergekit asks that I allow it to go ahead by --allow-crimes (lol), but it rejects this flag in legacy as unknown and I have no idea how to move into the life of crime when using the new version inside python.

vmajor avatar Nov 29 '23 05:11 vmajor

Huh! What models are you trying to merge? That generally doesn't come up.

I can add --allow-crimes to mergekit-legacy in case that's what you actually want to do, though. : )

cg123 avatar Nov 29 '23 23:11 cg123

Lol, that shows up when I try to merge neural-chat-7b-v3-1 with Orca-2-7b

Both are Llama 2 so I am curious why this error would show up at all. I was expecting errors when merging a PyTorch with a safetensors model, but no, that works.

I'd like to try it, yes.

vmajor avatar Nov 30 '23 00:11 vmajor

Ah, that's what's going on - Orca-2-7b is Llama 2, but neural-chat-7b-v3-1 is actually Mistral based.

I've added the option in commit b7134dc0563a25a48d2499d1c4e75be198cd47d3. Unfortunately this particular merge almost certainly still won't work. Mistral uses GQA and has generally different dimensions from Llama-2 7b.

cg123 avatar Nov 30 '23 00:11 cg123

wow, how did I miss that. Well you should know that I successfully merged neural-chat-7b-v3-1 (PyTorch) with Starling-LM-7B-alpha (safetensors), and that worked out well, as now I see that both are Mistral based and individual format does not matter.

Fascinating stuff...

vmajor avatar Nov 30 '23 01:11 vmajor

@vmajor Amazing. Could you please let me know what kind of merging method did you use? Is it pass_throguh?

shamanez avatar Dec 11 '23 03:12 shamanez

I used mergekit-legacy without any special arguments beyond weight and density, and the model size is a sum of sizes of the input models thus I believe it defaults to passthrough.

I also merged Orca 13B with itself and got a really interesting boost in reasoning ability: https://huggingface.co/vmajor/Orca2-13B-selfmerge-26B

vmajor avatar Dec 11 '23 05:12 vmajor

Wow. That's impressive. So basically you stack two models on top of each other right?

shamanez avatar Dec 11 '23 06:12 shamanez

Yes. I wanted to see if model performance increases due to "acquiring new knowledge" or is it (also) more basic than that - increasing the number of layers. The conventional wisdom is that merging models adds new information.

In my Orca experiment, knowledge did not improve, but reasoning more than doubled, if the GSM8K benchmark is to be trusted.

vmajor avatar Dec 11 '23 06:12 vmajor

Inspiring mate! I am also exploring knowledge improvements. Btw did you fine-tune the entire architecture after merging?

On Mon, 11 Dec 2023 at 7:07 PM, vmajor @.***> wrote:

Yes. I wanted to see if model performance increases due to "acquiring new knowledge" or is it (also) more basic than that - increasing the number of layers. The conventional wisdom is that merging models adds new information.

In my Orca experiment, knowledge did not improve, but reasoning more than doubled, if the GSM8K benchmark is to be trusted.

— Reply to this email directly, view it on GitHub https://github.com/cg123/mergekit/issues/11#issuecomment-1849383055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGRKEUMKJJ27ZTZDBLLYI2PINAVCNFSM6AAAAAA75MKUTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBZGM4DGMBVGU . You are receiving this because you commented.Message ID: @.***>

shamanez avatar Dec 11 '23 06:12 shamanez

No. I did not perform any additional work on the merged model.

vmajor avatar Dec 11 '23 07:12 vmajor

Fascinating.

On Mon, 11 Dec 2023 at 8:17 PM, vmajor @.***> wrote:

No. I did not perform any additional work on the merged model.

— Reply to this email directly, view it on GitHub https://github.com/cg123/mergekit/issues/11#issuecomment-1849453908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGR5QZOU464TXNV5DWDYI2XQPAVCNFSM6AAAAAA75MKUTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBZGQ2TGOJQHA . You are receiving this because you commented.Message ID: @.***>

shamanez avatar Dec 11 '23 09:12 shamanez

wow, how did I miss that. Well you should know that I successfully merged neural-chat-7b-v3-1 (PyTorch) with Starling-LM-7B-alpha (safetensors), and that worked out well, as now I see that both are Mistral based and individual format does not matter.

Fascinating stuff...

hi, it seems that it can work well if two models are the same architecture? no matter the llama or mistral? @vmajor

LCorleone avatar Dec 13 '23 03:12 LCorleone