llama.cpp WIP: Add model `merge` example

WIP: Add model `merge` example

Open ngxson opened this issue 5 months ago • 51 comments

I don't know if it's a good idea or not.

Still WIP, not tested, would be nice if some one can test it out.

usage: ./merge ./path/model_1 CONFIG1 ./path/model_2 CONFIG2 ./path/output

  CONFIG must be in format: p0-p1,p2-p3,p4,... Example: 0-5,7,8-12
  Optionally, you can specify the scaling for a range of layers, for example: 0-5*0.5,6-7*1. By default, scale will be 0.5. The number of layer start counting from 0.
  The embedding layer of the first model will be used
  NOTE: currently, only F16 model type is supported

Feb 26 '24 22:02 ngxson

https://github.com/ggerganov/llama.cpp/issues/4718#issuecomment-1873855226

For this Pr, I think in addition to merge two model, It should also add feature to evaluation of a single layer multiple times. Just reconfigure the same gguf.

Feb 27 '24 11:02 sorasoras

@sorasoras Yeah I think I'll try that next. For the moment, I couldn't yet tested this PR. Also, I planned to start by simply process layer-by-layer, that way I don't modify any offset (and thus no changes to metadata).

The function that you mentioned requires changing metadata which I haven't yet got time to look into. But definitely something I'll try in the future.

Feb 27 '24 20:02 ngxson

@sorasoras Yeah I think I'll try that next. For the moment, I couldn't yet tested this PR. Also, I planned to start by simply process layer-by-layer, that way I don't modify any offset (and thus no changes to metadata).

The function that you mentioned requires changing metadata which I haven't yet got time to look into. But definitely something I'll try in the future.

That's fair, but I was thinking changing metadata is easier to implement and test on existing models. It's harder to know what work or not when franklin merge different model. Anyway, Thanks for the hard work.

Feb 29 '24 08:02 sorasoras

I would be interesting in layer interleaving. Is this only for merging layers' weight linearly? Or can it do pass through?

Also this line is not entirely clear: CONFIG must be in format: p0-p1,p2-p3,p4,... Example: 0-5,7,8-12 It looks sequential, and only one config is given, so it's not clear what the second model's config should look like. If one mode has: 0-5,7,8-12, what should the config of the other model be? the gaps?

Most frankenmerges for passthough are done so:

dtype: float16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 20]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [10, 30]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [20, 40]
    model: 152334H/miqu-1-70b-sf
...

Can this kind of repeat of blocks be done with this code?

Feb 29 '24 09:02 dnhkng

@dnhkng Yeah in fact I have a typo error in 0-5,7,8-12, it should be 0-6,7,8-12

This PR only aims to merge the weight linearly, meaning it does not add or remove any layers to the merged model.

One thing I don't understand in the lazy merge kit format though, can you please clarify it?: does the interleaving means some layers are repeated (for example, [0-20] + [10-30] results in [0-10] + [10-20] + [10-20] + [10-30])

Thank you in advance.

Feb 29 '24 10:02 ngxson

Yeah in fact I have a typo error in 0-5,7,8-12, it should be 0-6,7,8-12

It's true that the logic for my CONFIG argument is not correct. In fact, it should always be used with the "scale". For example, if I want to take 0-7 from model A and 8-12 from model B:

CONFIG1 = 0-7*1,8-12*0
CONFIG2 = 0-7*0,8-12*1

But I'm planning to re-design the whole thing though, to prepare support for the "repeated layers" option

Feb 29 '24 10:02 ngxson

dtype: float16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 10]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [5, 15]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [10, 20]
    model: 152334H/miqu-1-70b-sf
...

This would result in: 0,1,2,3,4,5,6,7,8,9,5,6,7,8,9,10,11,12,13,14,10,11,12,13,14,15,16,17,18,19...

This is why Frankenmerge models are larger than base models.

Personally, I would be interesting in a hybrid approach, with the ability to merge and layer! i.e. We want this particular output from 2 models ( for one model, we could just use it again as the second model), which we'll call 'a' , and 'b' for brevity. We want to use a mixture of interleaving and layer merging, to get this final output. In this case, the first 3 layers are from model a, the forth is a mix of model a+b, and the next few layers repeat layers from model b: [a0, a1, a2, a3*0.5+b3*0.5, b4, b5, b6, b5, b6, b7]

Trying to stay with your parameter notation, the closest I could get for the 2 configs would be: model_a 0-2*1,3*0.5,0-5*0 model_b 0-2*0,3*0.5,4-6*1,5-7*1

As both configs must be the same length, for model_a we used 0-5*0 as filler at the end. Does that make sense?

Feb 29 '24 10:02 dnhkng

Thanks for the explanation.

This is why Frankenmerge models are larger than base models.

According to discussion #4718 , gguf format maybe benefit by pointing 2 weights on metadata to the same tensor, this way we can have 2 or more layers using same weights. I haven't tried this though, but surely it's essential if we want to have repeated layers.

Personally, I would be interesting in a hybrid approach, with the ability to merge and layer!

Trying to stay with your parameter notation, the closest I could get for the 2 configs would be: model_a 0-2*1,3*0.5,0-5*0 model_b 0-2*0,3*0.5,4-6*1,5-7*1

Having both merge + repeated layers is great. But for that, I think the whole notation that I invented 0-2*1,3*0.5,0-5*0 is just far too limited. I propose more readable syntax (written to a file) like:

a0*1 + b0*0
a0*1 + b0*0
a1*0 + b1*1

The file above results in output model having:

Layer 0: Model A layer 0
Layer 1: Model A layer 0
Layer 2: Model B layer 1

It's not as robust as lazy merge kit syntax (yml), but give us more space to improve in the future.

Additional, someone can easily write a python script to convert lazy merge kit yml to my syntax.

What do you think about this approach?

Feb 29 '24 11:02 ngxson

Sure, I think we should do it. I was about to start testing Mergekit now, but I can quickly switch gears and write Python converter script.

According to the discussion https://github.com/ggerganov/llama.cpp/issues/4718 , gguf format maybe benefit by pointing 2 weights on metadata to the same tensor, this way we can have 2 or more layers using same weights. I haven't tried this though, but surely it's essential if we want to have repeated layers.

Yes, that would be a better method. I have a large model I know quite well I've merged manually in ExllamaV2.It took a bit to sort out KV caching though, and there are issues when the model spans multiple GPUs. At first, I would just duplicate.

If you can generate the merging code, I can compare the results of your method to the measured result of my merge.

Update: I could write the Python converter, but now that I look in more detail, I think the layer-by-layer method here is much more powerful. Mergekit only allows either slice interleaving OR linear/spherical interpolation of all layers. The config model you describe is more verbose, but much more powerful. I would prefer that TBH.

TBH, there are two options, 1) easy parsing with just 3 values:

model-a layer, model-b layer, weight of model-a
0,0,1
0,0,1
1,1,0
2,2,0.5

Or YAML, and give all the details:

sources:
  - model-a: 152334H/miqu-1-70b-sf
  - model-b: 152334H/other-model-b-70b-sf
  - model-c: 152334H/other-model-c-70b-sf      # we can then add as many models as we want
layers:
  - 1:
    model-a:
       layer_source:1
       weight:0.5
    model-b:
       layer_source:1
       weight:0.5
    method:linear               # and offer various interpolation methods
  - 2:
    model-a:
       layer_source:2
       weight:0.0
    model-b:
       layer_source:2
       weight:1.0
    method:linear
  - 3:
    model-a:
       layer_source:3
       weight:0.3
    model-b:
       layer_source:5
       weight:0.3
    model-3:
     layer_source:5
     weight:0.4
    method:slerp
  - 4:
    model-a:
       layer_source:4
       weight:1.0
    method:none               # and do straight passthrough of a single layer if needed

Feb 29 '24 12:02 dnhkng

Thanks for the input, I'll need to rework this PR in the next days.

Regarding the format, I still having ability to specify weight of a and b separately can be interesting. I don't know what will happen if we take weightA*0.5 + weightB*0.6 for example (so the total weight becomes 1.1). It's also useful when you merge 3 models, the first pass can have weightA*0.33 + weightB*0.33 then second pass + weightC*0.33

The csv format should simplify the cpp parser code though, I'll consider that.

YML format is readable, but unfortunately we can never include a yml parser in llama.cpp.

However, having it as the input of your python script (and the python convert that yml into csv or something llama.cpp can understand) will be very useful.

Feb 29 '24 12:02 ngxson

Yes, the YAML could be converted to CSV easily, if we leave out various interpolation types.

For completeness, I would explicitly put in all weights, and normalise to reach a sum of 1.0 i.e. for two models:

model-a layer, model-b layer, weight of model-a, weight of model-b
0,0,1.0,0.0
0,0,1,0.0.0
1,1,0.0,0.0
2,2,0.5,0.5

and for three models:

model-a layer, model-b layer, model-b layer, weight of model-a, weight of model-b, weight of model-c
0,0,0, 1.0,0.0, 0.0
0,0,0, 1.0,0.0, 0.0
1,1,1, 0.0,1.0, 0.0
2,2,2,0.5,0.5,0
3,3,3,0.3,0.3,0.3

The last layer here gets normalised to 1/3, 1/3, 1/3.

Feb 29 '24 13:02 dnhkng

@dnhkng I updated by PR to have the ability to:

Merge multiple models at once (not just 2 models)
Use the CSV format that we discussed

To simplify my CSV parsing code, I choose the column in order "model - scale - model - scale" (instead of "model - model - scale - scale"

0,1.0,0,0.0    meaning: output layer 0 = A[0]*1.0 + B[0] * 0.0
0,1.0,0,0.0    meaning: output layer 1 = A[0]*1.0 + B[0] * 0.0
1,0.0,2,0.0    meaning: output layer 2 = A[1]*0.0 + B[2] * 0.0
2,0.5,1,0.5    meaning: output layer 3 = A[2]*0.5 + B[1] * 0.5

If you add the third model, the columns become "model - scale - model - scale - model - scale"

I tried it myself and confirmed that the output model can be loaded, inference without any problem. What I could not verify is that the merging result (semantic result) is good or not (in other words, did it do A*scale + B*scale correctly or not). Can you verify this? Thank you!

Mar 01 '24 13:03 ngxson

FYI, I was also thinking adding ability to merge quantized model, but at this stage it's quite tricky: I must dequantize it, do calculations with float then re-quantize it again. Currently I'm staying with single-thread model for simplification, but the whole "dequant-requant" thing should be done with multi-threading, too tricky for now.

Mar 01 '24 14:03 ngxson

Could you add a branch for pass-through (no linear interpolation) of quantized models?

I have a use case for that right now!

i.e. a single model quantized model, with repeating layers.

This issue is that, from my tests, model self-merging only starts to help from 34B models and up. At FP16, that's a huge amount of RAM required!

I have a model that is a positive outlier on a difficult LLM benchmark, so it should be relatively clear whether the merge worked. It's a 70B model, so I'll need to run the tests on an 80Gb GPU. Interpolating layers would be an added benefit in the future though!

I will pull your code and try on FP16 Llama7B now, but I know all outputs will be worse than the base model. However, I know regions of "really bad", and "slightly bad", so I can see if it is at least making sense.

Mar 01 '24 14:03 dnhkng

I'll try quantized model later. At least, loading a q4_K model then output it as f16 is not too complicated. Only requant part is too tricky for me.

Also, just for my curiosity: if you merge the model then use ./quantize to re-quant it again, does that work for you? This way it takes a lot of disk space, but you'll eventually get a model small enough to fit into RAM.

One thing I'll try to work on is ability to re-use same tensor for repeated layer. For now, if the output model has duplicated layer, the associated tensor data will be duplicated (not ideal)

Mar 01 '24 14:03 ngxson

Reusing layers makes sense, but the caching is tricky.

There's a discussion on my pull request for ExllamaV2 here: https://github.com/turboderp/exllamav2/pull/275

Mar 01 '24 14:03 dnhkng

I'll try quantized model later. At least, loading a q4_K model then output it as f16 is not too complicated. Only requant part is too tricky for me.

Also, just for my curiosity: if you merge the model then use ./quantize to re-quant it again, does that work for you? This way it takes a lot of disk space, but you'll eventually get a model small enough to fit into RAM.

One thing I'll try to work on is the ability to re-use same tensor for repeated layers. For now, if the output model has duplicated layer, the associated tensor data will be duplicated (not ideal)

I can try Q4 -> FP16 and re-quantization. I'll keep watching this pull request, and test it when it's ready. Intermediate disk space is fine, I have a few SSD Tb free ;)

Mar 01 '24 14:03 dnhkng

Reusing layers makes sense, but the caching is tricky.

Personally thinking, shared cache among layers is not something technically possible though. While the weight is the same, KV is calculated by embedding from the layers before it (correct me if I'm wrong).

For example, when you have 2 consecutive layers having same weight W[0] == W[1], then KV[1] = W[1]*(W[0]*KV[0])

P/s: I'm actually bad at math when I was in high school / university. Nowadays with all these machine learning stuff, I still imagine "tensor" to be "rubik cube" in my head

Mar 01 '24 14:03 ngxson

Reusing layers makes sense, but the caching is tricky.

Personally thinking, shared cache among layers is not something technically possible though. While the weight is the same, KV is calculated by embedding from the layers before it (correct me if I'm wrong).

For example, when you have 2 consecutive layers having same weight W[0] == W[1], then KV[1] = W[1]*(W[0]*KV[0])

Yes, you can't share cache, it would get overwritten on the higher layer processing... But it still works! The results are worse though, but that's not unexpected. The fact that it even slightly works is crazy though.

I have done quite a lot of testing on various permutations of layers, and most are worse. but there are a few interesting combinations. GGUF would be the best way to share them, as going via FP16 torch tensors, then merging, then converting to GGUF and finally quantization seems like a lot of wasted effort! Better to experiment in ExllamaV2 dynamically and build and distribute in GGUF.

Mar 01 '24 14:03 dnhkng

Yes, you can't share cache, it would get overwritten on the higher layer processing... But it still works! The results are worse though, but that's not unexpected. The fact that it even slightly works is crazy though.

Interesting! In fact, I even tried removing the last layer of 7B model and it's still works (about 80-90% of the time). I think because the neural network behaves like living creatures, it can have some "cells" removed / malfunction but still able to "recover" itself.

Mar 01 '24 14:03 ngxson

@dnhkng We're now accepting quantized models as input, but only output non-quant FP16 model (you can re-quant it using ./quantize tool). Can you give a try? Thanks!

Mar 01 '24 16:03 ngxson

As a sanity test, I tried this config, on the full precision models: config.csv

0, 0, 0.5, 0.5
1, 1, 0.5, 0.5
2, 2, 0.5, 0.5
3, 3, 0.5, 0.5
4, 4, 0.5, 0.5
5, 5, 0.5, 0.5
6, 6, 0.5, 0.5
7, 7, 0.5, 0.5
8, 8, 0.5, 0.5
9, 9, 0.5, 0.5
10, 10, 0.5, 0.5
11, 11, 0.5, 0.5
12, 12, 0.5, 0.5
13, 13, 0.5, 0.5
14, 14, 0.5, 0.5
15, 15, 0.5, 0.5
16, 16, 0.5, 0.5
17, 17, 0.5, 0.5
18, 18, 0.5, 0.5
19, 19, 0.5, 0.5
20, 20, 0.5, 0.5
21, 21, 0.5, 0.5

With this call: ./merge -c config.csv -o OUTPUT_FILE.gguf -m tinyllama-claude_16bit_GGUF-unsloth.F16.gguf -m tinyllama-claude_16bit_GGUF-unsloth.F16.gguf

The generated model generates garbage. this should recreate the original model, right?

I also tried:


0, 0, 0.0, 1.0
1, 1, 0.0, 1.0
2, 2, 0.0, 1.0
3, 3, 0.0, 1.0
4, 4, 0.0, 1.0
5, 5, 0.0, 1.0
6, 6, 0.0, 1.0
7, 7, 0.0, 1.0
8, 8, 0.0, 1.0
9, 9, 0.0, 1.0
10, 10, 0.0, 1.0
11, 11, 0.0, 1.0
12, 12, 0.0, 1.0
13, 13, 0.0, 1.0
14, 14, 0.0, 1.0
15, 15, 0.0, 1.0
16, 16, 0.0, 1.0
17, 17, 0.0, 1.0
18, 18, 0.0, 1.0
19, 19, 0.0, 1.0
20, 20, 0.0, 1.0
21, 21, 0.0, 1.0

Also didn't work on FP16. Maybe Im passing the parameters incorrectly?

I'll test quants now.

Mar 01 '24 16:03 dnhkng

The columns is model - scale - model - scale but not all models then all scales, can you re-check it?
Maybe also remove the space in CSV, for example 21, 21, 0.0, 1.0 to 21,0.0,21,1.0

Mar 01 '24 16:03 ngxson

Still not working. Could you let me know which model you are testing on, so I can test on so that I can narrow down the issue?

Mar 01 '24 16:03 dnhkng

I added a debug message to test if the parser is correct:

Parsing configurations:
- Layer 0 = + model[0].layer[0]*1 + model[1].layer[0]*0
- Layer 1 = + model[0].layer[1]*0 + model[1].layer[2]*1

I'm using dolphin-mistral q4_K_M. However, I could only merge about 5 layers because my RAM is not big enough to load the FP16 output if I merge all layers.

One option to test the output, can we some way dump the tensor and compare them visually? On https://github.com/ggerganov/llama.cpp/pull/5810 they dump the first + last 3 elements of each tensor. I'm going to do just that but later maybe tonight (France timezone, you're in Germany right?). If you're interested, can you do a small gguf-py script to do that? It will be handy in the future I think.

Maybe because I modified my code to support quantized model, so some place there're problems with memory alignment

Mar 01 '24 16:03 ngxson

Debug looks ok:

./merge -c config.csv -o OUTPUT_FILE.gguf -m tinyllama-claude_16bit_GGUF-unsloth.F16.gguf -m tinyllama-claude_16bit_GGUF-unsloth.F16.gguf 
Parsing configurations:
- Layer 0 = + model[0].layer[0]*1 + model[1].layer[0]*0
- Layer 1 = + model[0].layer[1]*1 + model[1].layer[1]*0
- Layer 2 = + model[0].layer[2]*1 + model[1].layer[2]*0
- Layer 3 = + model[0].layer[3]*1 + model[1].layer[3]*0
- Layer 4 = + model[0].layer[4]*1 + model[1].layer[4]*0
- Layer 5 = + model[0].layer[5]*1 + model[1].layer[5]*0
- Layer 6 = + model[0].layer[6]*1 + model[1].layer[6]*0
- Layer 7 = + model[0].layer[7]*1 + model[1].layer[7]*0
- Layer 8 = + model[0].layer[8]*1 + model[1].layer[8]*0
- Layer 9 = + model[0].layer[9]*1 + model[1].layer[9]*0
- Layer 10 = + model[0].layer[10]*1 + model[1].layer[10]*0
- Layer 11 = + model[0].layer[11]*1 + model[1].layer[11]*0
- Layer 12 = + model[0].layer[12]*1 + model[1].layer[12]*0
- Layer 13 = + model[0].layer[13]*1 + model[1].layer[13]*0
- Layer 14 = + model[0].layer[14]*1 + model[1].layer[14]*0
- Layer 15 = + model[0].layer[15]*1 + model[1].layer[15]*0
- Layer 16 = + model[0].layer[16]*1 + model[1].layer[16]*0
- Layer 17 = + model[0].layer[17]*1 + model[1].layer[17]*0
- Layer 18 = + model[0].layer[18]*1 + model[1].layer[18]*0
- Layer 19 = + model[0].layer[19]*1 + model[1].layer[19]*0
- Layer 20 = + model[0].layer[20]*1 + model[1].layer[20]*0
- Layer 21 = + model[0].layer[21]*1 + model[1].layer[21]*0
llama_model_loader: loaded meta data with 22 key-value pairs and 201 tensors from tinyllama-claude_16bit_GGUF-unsloth.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Cossale
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type  f16:  156 tensors
llama_model_loader: loaded meta data with 22 key-value pairs and 201 tensors from tinyllama-claude_16bit_GGUF-unsloth.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Cossale
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type  f16:  156 tensors
====> Set new value of llama.block_count = 22
blk.0.attn_k.weight
blk.0.attn_norm.weight
blk.0.attn_output.weight
blk.0.attn_q.weight
blk.0.attn_v.weight
...

But the merged model still only generates garbage.

I'm testing with https://huggingface.co/Cossale/tinyllama-claude_16bit_GGUF/tree/main This model if FP16 and well, tiny! Maybe it fits on your RAM?

Mar 01 '24 16:03 dnhkng

OK, the issue is the tensors are empty!

[tinyllama-claude_16bit_GGUF/tree/main](https://huggingface.co/Cossale/tinyllama-claude_16bit_GGUF/tree/main)
...
blk.21.attn_norm.weight        | Shape: 2048            | Size: 2048         | Quantization: F32
[0.42382812 0.43945312 0.48046875 0.43945312 0.43164062 0.4453125
 0.41796875 0.4296875  0.4140625  0.47070312]
blk.21.ffn_norm.weight         | Shape: 2048            | Size: 2048         | Quantization: F32
[0.55859375 0.55078125 0.54296875 0.578125   0.55078125 0.56640625
 0.58203125 0.55078125 0.5625     0.55078125]
output_norm.weight             | Shape: 2048            | Size: 2048         | Quantization: F32
[1.921875  1.8203125 1.9453125 1.984375  1.9140625 1.90625   1.9140625
 1.6640625 1.9296875 1.9765625]
output.weight                  | Shape: 2048x32000      | Size: 65536000     | Quantization: F16
[ 0.01141  -0.02356  -0.02466  -0.000475 -0.01239  -0.003082  0.01117
 -0.00095   0.00903   0.007782]

Merged model:

blk.21.attn_output.weight      | Shape: 2048x2048       | Size: 4194304      | Quantization: F16
[-0. -0. -0.  0.  0.  0. -0.  0.  0.  0.]
blk.21.attn_q.weight           | Shape: 2048x2048       | Size: 4194304      | Quantization: F16
[-0. -0. -0.  0. -0. -0.  0. -0.  0.  0.]
blk.21.attn_v.weight           | Shape: 2048x256        | Size: 524288       | Quantization: F16
[-0. -0.  0. -0.  0. -0.  0.  0.  0.  0.]
blk.21.ffn_down.weight         | Shape: 5632x2048       | Size: 11534336     | Quantization: F16
[ 0.  0. -0.  0.  0.  0. -0.  0. -0.  0.]
blk.21.ffn_gate.weight         | Shape: 2048x5632       | Size: 11534336     | Quantization: F16
[ 0. -0.  0.  0. -0.  0. -0.  0. -0. -0.]
blk.21.ffn_norm.weight         | Shape: 2048            | Size: 2048         | Quantization: F32
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
blk.21.ffn_up.weight           | Shape: 2048x5632       | Size: 11534336     | Quantization: F16
[-0.  0. -0.  0.  0.  0.  0. -0.  0. -0.]

Code:

from gguf.gguf_reader import GGUFReader
gguf_file_path = 'tinyllama-claude_16bit_GGUF-unsloth.F16.gguf'
reader = GGUFReader(gguf_file_path)
for tensor in reader.tensors:
        print(tensor.name)
        print(tensor.data[:10])

Mar 01 '24 17:03 dnhkng

Nice, thanks for the info! It's true that I have misalignment somewhere, I'll have a look tonight.

Mar 01 '24 17:03 ngxson

@dnhkng I rewrite the part where it actually do the calculation. As a side effect, you can now input + output quantized model (yay, that's what you asked for).

I still doesn't work though. My test case: model A and model B are both the same, and merge config:

2,0.1,2,0.9
3,0.1,3,0.9
4,0.1,4,0.9
5,0.1,5,0.9
6,0.1,6,0.9
...

Basically I want to use the merge to "copy" the model by merging it with itself.

It still doesn't work though. I added debug log to print first 3 elements of each tensor. All looks correct, still don't understand why it doesn't work...

===> INPUT  [29] -0.005869 -0.000692 -0.007163
===> OUTPUT [29] -0.005869 -0.000692 -0.007163
[ 271/ 291]               blk.29.ffn_gate.weight - [ 4096, 14336,     1,     1], input type =   q4_K
===> INPUT  [29] 3.799654 3.939128 3.812827
===> OUTPUT [29] 3.799654 3.939128 3.812827
[ 272/ 291]               blk.29.ffn_norm.weight - [ 4096,     1,     1,     1], input type =    f32
===> INPUT  [29] 0.000940 0.001669 -0.005622
===> OUTPUT [29] 0.000940 0.001669 -0.005622
[ 273/ 291]                 blk.29.ffn_up.weight - [ 4096, 14336,     1,     1], input type =   q4_K

Mar 01 '24 21:03 ngxson

I finally get it working.

You can now use quant as input and it will be requant (imatrix is not supported, only q4 and up is supported).

Mar 01 '24 23:03 ngxson

llama.cpp llama.cpp copied to clipboard

WIP: Add model `merge` example

llama.cpp
llama.cpp copied to clipboard