imatrix: add option to display importance score statistics for a given imatrix file
A new --show-statistics option generates a report highlighting which tensors/layers contribute the most in a model. The report is sorted from the highest influence to lowest. The process computes the average value of scores per tensor/layer and calculates their % contribution, exiting immediately after completion.
This PR can be used along with quantize: Handle user-defined quantization levels for additional tensors to do layer-wise quantization similar, but not quite the same, to the process described in Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels
Output example:
llama-imatrix --in-file imatrix-DeepSeek-R1-Distill-Llama-8B-small.dat --show-statistics
Computing statistics for imatrix-DeepSeek-R1-Distill-Llama-8B-small.dat (225 tensors)
Layer Tensor μ(Importance Scores) Contribution
================================================================================
- output 5523.92 13.9226 %
27 attn_v 356.58 0.8987 %
27 attn_k 356.58 0.8987 %
27 attn_q 356.58 0.8987 %
24 attn_k 347.19 0.8751 %
24 attn_q 347.19 0.8751 %
24 attn_v 347.19 0.8751 %
25 attn_q 346.77 0.8740 %
25 attn_k 346.77 0.8740 %
25 attn_v 346.77 0.8740 %
29 attn_v 344.46 0.8682 %
...
0 ffn_down 0.09 0.0002 %
Nice idea, seems like something we discuss the last time? @bartowski1182
Btw is it possible to show importance score from an existing imatrix file @EAddario ?
Thank you @ngxson. Yes, it will process any imatrix file produced by llama-imatrix, but it is restricted to single file (does not deal with multiple --in-file)
Isn't this just related to the hidden state norms getting larger as you move through the different layers? If so, then it won't really account for the accumulation of errors caused by an early layer on the final output?
Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for each, add those averages together, and then compute their individual contributions from the total.
The logic llama-imatrix uses to calculate the IS is to square the value of the corresponding weight during inference, keep a running total of how many times that particular value has been updated, and then save the average when inference has finished.
This only applies to 2d or larger tensors, so it will ignore norms (1d), but since errors influence which weights get updated (and how frequently), the IS does account for errors, albeit indirectly.
Make sense?
Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for each, add those averages together, and then compute their individual contributions from the total.
@EAddario
I think the mean squared activations (which would be their variance assuming a mean of 0) cannot really be compared across tensors without some kind of normalization, because the values of the model weights can also affect the relative importance of the activations. (llama-imatrix calculates the sum of squared activations and their count, it doesn't directly take into account the model weights; it's only when quantizing that they are taken into account (and even then it depends on the type))
The goal here is to find which layers need more precision, right?
I'm not sure if the mean squared activations really are what you're looking for.
There might be other measures like skewness and kurtosis which may be useful. But I'm not sure if taking only the activations into account is the right way to get the insights you seek.
What I'd like to try eventually would be to use a simultaneous quantization algorithm to try multiple bit-widths at once in a reasonable amount of time so that the errors can be compared per tensor to help with the choice of quantization type.
This would be possible for x[i] ≈ q[i] * s types using a cumulative search similar to #12557, but I don't know how to do that with x[i] ≈ q[i] * s - m types yet.
I still think it can be useful to have some way to visualize what is in imatrix files and/or the distribution of the activations. But not all the necessary information is kept in imatrix files, only the per-channel sum of squared activations, which is a bit limiting for this purpose. Adding more measures (like the mean, skewness and kurtosis, either per-tensor or per-channel) in the file would be easier after #9400.
In the paper you link (https://arxiv.org/pdf/2406.17415), the closest thing to what you propose would be the LIM (layer input modification) score, which is calculated as follows (in Section 3.1), where $L_i$ is the i-th layer, and $L_i^I$ are the input activations and $L_i^O$ the corresponding output activations:
$$ LIM(L_i) = -\frac{L_i^I \cdot L_i^O}{\left|L_i^I\right| \left|L_i^O\right|} $$
llama-imatrix technically has access to both the input and output activations of a layer, but only uses its input.
Very clear now, thanks @compilade. You're correct, I'm using the mean squared activation averaged to identify which tensors/layers produce large magnitude activations and ~~whilst~~ agree it isn't as accurate as, say, correlation / covariance / LIM ~~I think it's still a reasonable proxy, specially considering how the importance scores are actually used during quantization (quant_weights in ggml-quants.c)~~
I had a quick look at your PRs. I definitely like the idea of storing imatrix data in GGUF format and can appreciate how it would improve the generation of these types of stats. #12557 is quite intriguing, but truth be told I haven't had a chance to really digest it fully (there's a lot going on!) but would love to see it merged specially if it improves ternary quants
Had a chance to think this more thoroughly and now get the implications of @jukofyork and @compilade's comments. Agree my current approach is not really identifying influence but rather score "growth". Back to the drawing table 😆
Had a chance to think this more thoroughly and now get the implications of @jukofyork and @compilade's comments. Agree my current approach is not really identifying influence but rather score "growth". Back to the drawing table 😆
I can help you with this, but it will need a fair bit of compute to calculate. I've not got time to explain fully but basically:
- Decide on what you are optimising: L2-error in the final hidden-state, perplexity (ie: "wellcalibratedness" of the top choice), KL-divergence (ie: "wellcalibratedness" of the full probability distribution), earth-movers-distance, hinge-loss, or whatever.
- Use some form of (2-sided) Finite-Differences to to estimate the gradient of the loss you are optimising with respect to moving up/down 1 bit of quant for a given parameter group (eg: layer-based or tensor-based groupings).
You will likely have to transform the loss measure somehow:
- Perplexity is actually just a transformed version of negative log-loss, as is McFadden's Pseudo-R-squared and a whole host of different domain-specific measures of "wellcalibratedness". The fact people often plot the
log-PPLsuggests this is not a good metric to use for this... - The real thing you are measuring is "bits" (in the Information Theory sense; not the normal colloquial term) and negative-log-loss has a nice interpretation for this (the late David MacKay's book Information Theory, Inference, and Learning Algorithms is an amazing read to see the links if you are more interested in this!).
Assuming Finite-Differences is too costly to perform, then then you can use a stochastic approximation (FDSA) or its extension SPSA to estimate the gradients using whatever compute you can muster up.
I've edited the post above quite a lot so should hopefully make more sense (in case you're reading from the email notification).
Thank you, now I know what I'm doing over the weekend 😁
On a serious note, much appreciated @jukofyork. Plenty of food for thought. I'll give it proper consideration
Thank you, now I know what I'm doing over the weekend 😁
On a serious note, much appreciated @jukofyork. Plenty of food for thought. I'll give it proper consideration
No problem and just remember the most important thing to figure out is exactly what you are optimising first! There are actually a lot of compelling options for this; each with their own reasons for and against... All have different costs to compute too:
- Metrics using the full probability distribution like KL-divergence or earth-movers distance are the most expensive.
- Then metrics that need a probability and have to pass through softmax are next.
- Then metrics that require multiplication with
lm_head(which in modern models can be >>hidden_dim!) are next. - Metrics involving the final hidden state are the cheapest.
Following from @jukofyork and @compilade's remarks and suggestions, I've made some changes in my approach.
To set the context, and explain exactly what's the problem I'm trying to solve, I have two objectives in mind:
- find a way to identify and rank which tensors/layers are most influential during inference, and
- implement changes in a 100% backwards compatible way. That is, they must work with any imatrix file already generated.
The direct implication of constraint "2" is no changes to IMatrixCollector::collect_imatrix, meaning that "1" has to rely solely on the importance scores (IS) stored in imatrix files, without access to the underlying weights.
As noted by @compilade, IS "...cannot really be compared across tensors without some kind of normalization, because the values of the model weights can also affect the relative importance of the activations..." however, IS are a direct measurement of how active a particular weight was during inference, based on a given input prompt (more on this later), and therefore can be used as a (arguably suboptimal) proxy for "influence" but instead of relying on the average, a better metric is to use the sum of IS per tensor/layer (the higher the number, the "busier" the tensor/layer and the more it contributes to upstream computations).
Although there are better metrics (e.g. gradient of loss, covariance, LIM, etc.), those would require changes to the imatrix collection process, which is beyond the scope of what I'm trying to do, at least for now. Having said that, it's worth keeping an eye on the work @ubergarm is doing in WIP Compute per layer LIM Scores during imatrix
Tests performed during quantization of DeepSeek-R1-Distill-Qwen-7B seem to confirm that Σ(Bias), which is what I'm calling the sum of IS per tensor, is a good influence indicator as it can be seen in the table below, where (↑) represents quantizing half of the most influential tensors (as per Σ(Bias)) at a higher bit level, and (↓) represents quantizing half of the least influential tensors at a higher bit level:
| Model | μPPL (↑) | 𝜌PPL (↑) | μKLD (↑) | RMS Δp (↑) | μPPL (↓) | 𝜌PPL (↓) | μKLD (↓) | RMS Δp (↓) |
|---|---|---|---|---|---|---|---|---|
| IQ3_M | 28.740047 ±0.291290 | 97.19% | 0.229742 ±0.000770 | 11.793 ±0.050 | 28.721610 ±0.288684 | 96.94% | 0.249550 ±0.000841 | 12.332 ±0.053 |
| IQ3_S | 30.290800 ±0.307742 | 96.32% | 0.310982 ±0.001014 | 13.415 ±0.057 | 31.315997 ±0.316217 | 95.95% | 0.341996 ±0.001082 | 14.292 ±0.058 |
| IQ4_NL | 23.570503 ±0.226124 | 98.59% | 0.102854 ±0.000465 | 8.080 ±0.046 | 23.862907 ±0.226366 | 98.51% | 0.117131 ±0.000395 | 8.560 ±0.040 |
| Q3_K_L | 24.160705 ±0.229989 | 97.75% | 0.173336 ±0.000603 | 10.337 ±0.048 | 24.853047 ±0.240164 | 97.56% | 0.195060 ±0.000681 | 10.801 ±0.050 |
| Q3_K_M | 24.967196 ±0.239198 | 97.50% | 0.194299 ±0.000681 | 10.877 ±0.050 | 25.212714 ±0.244888 | 97.31% | 0.214337 ±0.000747 | 11.278 ±0.052 |
| Q3_K_S | 25.661098 ±0.246635 | 96.84% | 0.243850 ±0.000852 | 12.143 ±0.054 | 25.916397 ±0.250857 | 96.60% | 0.270237 ±0.000928 | 12.737 ±0.057 |
| Q4_K_M | 23.125382 ±0.221860 | 99.24% | 0.053997 ±0.000215 | 5.795 ±0.032 | 23.283282 ±0.223537 | 99.13% | 0.065186 ±0.000241 | 6.273 ±0.034 |
| Q4_K_S | 23.156199 ±0.222000 | 99.18% | 0.058337 ±0.000233 | 6.026 ±0.034 | 23.263445 ±0.223330 | 99.08% | 0.069488 ±0.000261 | 6.429 ±0.035 |
| Q5_K_M | 22.726887 ±0.217691 | 99.75% | 0.013562 ±0.000062 | 2.924 ±0.020 | 22.903038 ±0.220259 | 99.72% | 0.015792 ±0.000063 | 3.114 ±0.019 |
| Q5_K_S | 22.766826 ±0.218244 | 99.74% | 0.014589 ±0.000070 | 3.024 ±0.020 | 22.892603 ±0.220059 | 99.71% | 0.017023 ±0.000073 | 3.231 ±0.020 |
| Q6_K | 22.859294 ±0.219461 | 99.87% | 0.004317 ±0.000022 | 1.682 ±0.016 | 22.847118 ±0.219384 | 99.86% | 0.004950 ±0.000021 | 1.767 ±0.012 |
| Q8_0 | 22.840693 ±0.219408 | 99.90% | 0.001614 ±0.000011 | 1.050 ±0.010 | 22.832647 ±0.219310 | 99.90% | 0.001830 ±0.000024 | 1.110 ±0.016 |
For reference, compared to the naive Q4_K_M model, the layer-wised quantized is 10.7% smaller (4.68GB vs 4.18GB) with only a 0.35% penalty on μPPL:
| Model | μPPL | 𝜌PPL | μKLD | RMS Δp |
|---|---|---|---|---|
| Q4_K_M | 22.936432 ±0.220488 | 99.59% | 0.026917 ±0.000105 | 4.100 ±0.024 |
Whilst I was considering @jukofyork's feedback, I came to think of how much the benefit of using an imatrix is dependent on the quality of the prompt used during its generation, and how difficult it's to determine how well a given prompt "exercises" all of the model's capabilities, so I added additional statistics to help in that regard.
As things stand at the moment, --show-statistics now produce the following statistics:
Σ(Bias): the sum of all squared activations across the tensor (i.e. the Importance Scores) Min & Max: minimum and maximum activation values μ & σ: Activation's Mean and Standard Deviation % Active: proportion of elements whose average activation exceeds a very small threshold (1e-6). Helpful to determine how alive/dormant the tensor is during inference N: number of activations in the tensor Entropy: entropy of the activation distribution, in bits (standard Shannon entropy measurement) $S = -\sum_{i=1}^N p_i \log_2 p_i$ E (norm): Normalized entropy. $E(norm)=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}$. These two metrics can be used to determine how well a prompt "exercises" the model's capabilities ZD Score: z-score distribution as described in 3.1 Layer Importance Scores in the Layer-Wise Quantization paper
Thanks for the update and defining the statistics gleaned from an existing imatrix.dat file. I pulled your branch and gave it a try on LLaMA-2-13B to compare against the same model used in that Layer-wise Quantization Paper (likely different quantization).
compute imatrix and then show statistics
Compute imatrix
$ git branch | grep '*'
* (HEAD detached at EAddario/imatrix)
$ git rev-parse --short HEAD
200d88c8
$ ./build/bin/llama-imatrix --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 5136 (200d88c8)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
$ ./build/bin/llama-imatrix \
--verbosity 1 \
-m /mnt/astrodata/llm/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q8_0.gguf \
-f wiki.test.raw \
-o imatrix-wiki-test-llama-2-13b-chat-Q8_0-gguf.dat \
--ctx-size 512 \
--threads 16
...
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
...
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 397.256 ms
compute_imatrix: computing over 655 chunks with batch_size 512
compute_imatrix: 1.44 seconds per pass - ETA 15.73 minutes
[1]4.8087,[2]5.4272,[3]6.3040,[4]7.0129,[5]7.1984,[6]7.0947,[7]7.2490,[8]7.3314,[9]7.5682,
...
Final estimate: PPL = 6.5257 +/- 0.04210
save_imatrix: stored collected data after 655 chunks in imatrix-wiki-test-llama-2-13b-chat-Q8_0-gguf.dat
llama_perf_context_print: load time = 22623.39 ms
llama_perf_context_print: prompt eval time = 861807.99 ms / 335360 tokens ( 2.57 ms per token, 389.14 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 891205.70 ms / 335361 tokens
Show Statistics
$ ./build/bin/llama-imatrix \
--in-file imatrix-wiki-test-llama-2-13b-chat-Q8_0-gguf.dat \
--show-statistics
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
Computing statistics for imatrix-wiki-test-llama-2-13b-chat-Q8_0-gguf.dat (280 tensors)
Layer Tensor Σ(Bias) Min Max μ σ % Active N Entropy E (norm) ZD Score
==========================================================================================================================================================================
30 attn_q 1321.16 0.0000 22.1645 0.2580 0.5248 99.98% 5120 11.8988 96.57% 5.4688
30 attn_v 1321.16 0.0000 22.1645 0.2580 0.5248 99.98% 5120 11.8988 96.57% 5.4688
30 attn_k 1321.16 0.0000 22.1645 0.2580 0.5248 99.98% 5120 11.8988 96.57% 5.4688
39 ffn_down 1290.84 0.0042 29.1379 0.0934 0.4147 100.00% 13824 12.1372 88.24% 25.9693
32 attn_v 1285.53 0.0000 17.6335 0.2511 0.4668 99.98% 5120 11.9402 96.90% 5.4688
32 attn_k 1285.53 0.0000 17.6335 0.2511 0.4668 99.98% 5120 11.9402 96.90% 5.4688
32 attn_q 1285.53 0.0000 17.6335 0.2511 0.4668 99.98% 5120 11.9402 96.90% 5.4688
34 attn_q 1256.21 0.0000 14.0536 0.2454 0.4260 99.98% 5120 11.9679 97.13% 5.6641
34 attn_v 1256.21 0.0000 14.0536 0.2454 0.4260 99.98% 5120 11.9679 97.13% 5.6641
34 attn_k 1256.21 0.0000 14.0536 0.2454 0.4260 99.98% 5120 11.9679 97.13% 5.6641
29 attn_k 1204.44 0.0000 23.4754 0.2352 0.5280 99.98% 5120 11.8456 96.13% 5.4688
29 attn_v 1204.44 0.0000 23.4754 0.2352 0.5280 99.98% 5120 11.8456 96.13% 5.4688
29 attn_q 1204.44 0.0000 23.4754 0.2352 0.5280 99.98% 5120 11.8456 96.13% 5.4688
33 attn_q 1183.21 0.0000 14.3861 0.2311 0.3921 99.98% 5120 11.9785 97.21% 5.4688
33 attn_v 1183.21 0.0000 14.3861 0.2311 0.3921 99.98% 5120 11.9785 97.21% 5.4688
33 attn_k 1183.21 0.0000 14.3861 0.2311 0.3921 99.98% 5120 11.9785 97.21% 5.4688
31 attn_k 1182.86 0.0000 20.5292 0.2310 0.4778 99.98% 5120 11.8971 96.55% 5.4688
31 attn_v 1182.86 0.0000 20.5292 0.2310 0.4778 99.98% 5120 11.8971 96.55% 5.4688
31 attn_q 1182.86 0.0000 20.5292 0.2310 0.4778 99.98% 5120 11.8971 96.55% 5.4688
35 attn_k 1173.15 0.0000 12.3308 0.2291 0.3496 99.98% 5120 12.0212 97.56% 5.6641
35 attn_v 1173.15 0.0000 12.3308 0.2291 0.3496 99.98% 5120 12.0212 97.56% 5.6641
35 attn_q 1173.15 0.0000 12.3308 0.2291 0.3496 99.98% 5120 12.0212 97.56% 5.6641
28 attn_v 1161.62 0.0000 24.2086 0.2269 0.5975 99.98% 5120 11.7171 95.09% 5.6641
28 attn_q 1161.62 0.0000 24.2086 0.2269 0.5975 99.98% 5120 11.7171 95.09% 5.6641
28 attn_k 1161.62 0.0000 24.2086 0.2269 0.5975 99.98% 5120 11.7171 95.09% 5.6641
27 attn_q 1152.05 0.0000 21.7389 0.2250 0.5541 99.98% 5120 11.7706 95.53% 5.4688
27 attn_k 1152.05 0.0000 21.7389 0.2250 0.5541 99.98% 5120 11.7706 95.53% 5.4688
27 attn_v 1152.05 0.0000 21.7389 0.2250 0.5541 99.98% 5120 11.7706 95.53% 5.4688
36 attn_q 1125.94 0.0000 12.8438 0.2199 0.3751 99.98% 5120 11.9677 97.13% 5.8594
36 attn_k 1125.94 0.0000 12.8438 0.2199 0.3751 99.98% 5120 11.9677 97.13% 5.8594
36 attn_v 1125.94 0.0000 12.8438 0.2199 0.3751 99.98% 5120 11.9677 97.13% 5.8594
38 attn_k 1072.28 0.0151 12.4462 0.2094 0.3015 100.00% 5120 12.0386 97.70% 6.4453
38 attn_v 1072.28 0.0151 12.4462 0.2094 0.3015 100.00% 5120 12.0386 97.70% 6.4453
38 attn_q 1072.28 0.0151 12.4462 0.2094 0.3015 100.00% 5120 12.0386 97.70% 6.4453
37 attn_v 1071.17 0.0126 14.2128 0.2092 0.3167 100.00% 5120 12.0204 97.55% 6.2500
37 attn_k 1071.17 0.0126 14.2128 0.2092 0.3167 100.00% 5120 12.0204 97.55% 6.2500
37 attn_q 1071.17 0.0126 14.2128 0.2092 0.3167 100.00% 5120 12.0204 97.55% 6.2500
25 attn_v 1037.08 0.0000 23.9319 0.2026 0.6313 99.98% 5120 11.5734 93.93% 5.4688
25 attn_q 1037.08 0.0000 23.9319 0.2026 0.6313 99.98% 5120 11.5734 93.93% 5.4688
25 attn_k 1037.08 0.0000 23.9319 0.2026 0.6313 99.98% 5120 11.5734 93.93% 5.4688
26 attn_k 1031.55 0.0031 25.6229 0.2015 0.6353 100.00% 5120 11.5771 93.96% 5.6641
26 attn_v 1031.55 0.0031 25.6229 0.2015 0.6353 100.00% 5120 11.5771 93.96% 5.6641
26 attn_q 1031.55 0.0031 25.6229 0.2015 0.6353 100.00% 5120 11.5771 93.96% 5.6641
24 attn_k 955.35 0.0000 20.3266 0.1866 0.5947 99.98% 5120 11.5271 93.55% 5.8594
24 attn_q 955.35 0.0000 20.3266 0.1866 0.5947 99.98% 5120 11.5271 93.55% 5.8594
24 attn_v 955.35 0.0000 20.3266 0.1866 0.5947 99.98% 5120 11.5271 93.55% 5.8594
23 attn_k 950.08 0.0000 22.1702 0.1856 0.6765 99.98% 5120 11.3836 92.39% 5.4688
23 attn_v 950.08 0.0000 22.1702 0.1856 0.6765 99.98% 5120 11.3836 92.39% 5.4688
23 attn_q 950.08 0.0000 22.1702 0.1856 0.6765 99.98% 5120 11.3836 92.39% 5.4688
39 attn_q 926.54 0.0431 16.0860 0.1810 0.2805 100.00% 5120 12.0610 97.88% 5.8594
39 attn_k 926.54 0.0431 16.0860 0.1810 0.2805 100.00% 5120 12.0610 97.88% 5.8594
39 attn_v 926.54 0.0431 16.0860 0.1810 0.2805 100.00% 5120 12.0610 97.88% 5.8594
22 attn_v 916.79 0.0000 18.9033 0.1791 0.5414 99.98% 5120 11.5694 93.89% 5.8594
22 attn_q 916.79 0.0000 18.9033 0.1791 0.5414 99.98% 5120 11.5694 93.89% 5.8594
22 attn_k 916.79 0.0000 18.9033 0.1791 0.5414 99.98% 5120 11.5694 93.89% 5.8594
38 ffn_down 905.56 0.0059 75.8273 0.0655 0.7782 100.00% 13824 11.5526 83.99% 2.0255
19 attn_q 879.58 0.0100 28.6687 0.1718 0.8143 100.00% 5120 10.9550 88.91% 6.0547
19 attn_v 879.58 0.0100 28.6687 0.1718 0.8143 100.00% 5120 10.9550 88.91% 6.0547
19 attn_k 879.58 0.0100 28.6687 0.1718 0.8143 100.00% 5120 10.9550 88.91% 6.0547
36 ffn_up 870.19 0.0086 1.1614 0.1700 0.0388 100.00% 5120 12.2979 99.81% 38.4766
36 ffn_gate 870.19 0.0086 1.1614 0.1700 0.0388 100.00% 5120 12.2979 99.81% 38.4766
37 ffn_up 866.00 0.0098 1.3722 0.1691 0.0456 100.00% 5120 12.2901 99.74% 40.2344
37 ffn_gate 866.00 0.0098 1.3722 0.1691 0.0456 100.00% 5120 12.2901 99.74% 40.2344
21 attn_k 865.62 0.0092 22.5825 0.1691 0.7082 100.00% 5120 11.1497 90.49% 6.0547
21 attn_q 865.62 0.0092 22.5825 0.1691 0.7082 100.00% 5120 11.1497 90.49% 6.0547
21 attn_v 865.62 0.0092 22.5825 0.1691 0.7082 100.00% 5120 11.1497 90.49% 6.0547
13 attn_k 863.66 0.0136 41.3031 0.1687 1.1620 100.00% 5120 10.2387 83.09% 5.6641
13 attn_q 863.66 0.0136 41.3031 0.1687 1.1620 100.00% 5120 10.2387 83.09% 5.6641
13 attn_v 863.66 0.0136 41.3031 0.1687 1.1620 100.00% 5120 10.2387 83.09% 5.6641
3 ffn_down 863.54 0.0001 849.5108 0.0625 7.2252 100.00% 13824 0.2206 1.60% 0.0723
16 attn_v 860.58 0.0155 39.5863 0.1681 1.0040 100.00% 5120 10.5837 85.89% 6.0547
16 attn_q 860.58 0.0155 39.5863 0.1681 1.0040 100.00% 5120 10.5837 85.89% 6.0547
16 attn_k 860.58 0.0155 39.5863 0.1681 1.0040 100.00% 5120 10.5837 85.89% 6.0547
14 attn_q 859.59 0.0144 48.8121 0.1679 1.2058 100.00% 5120 10.1958 82.75% 5.4688
14 attn_v 859.59 0.0144 48.8121 0.1679 1.2058 100.00% 5120 10.1958 82.75% 5.4688
14 attn_k 859.59 0.0144 48.8121 0.1679 1.2058 100.00% 5120 10.1958 82.75% 5.4688
18 attn_k 843.95 0.0084 26.9360 0.1648 0.7675 100.00% 5120 10.9957 89.24% 6.0547
18 attn_v 843.95 0.0084 26.9360 0.1648 0.7675 100.00% 5120 10.9957 89.24% 6.0547
18 attn_q 843.95 0.0084 26.9360 0.1648 0.7675 100.00% 5120 10.9957 89.24% 6.0547
17 attn_k 842.77 0.0124 33.2876 0.1646 0.8841 100.00% 5120 10.7489 87.23% 5.8594
17 attn_v 842.77 0.0124 33.2876 0.1646 0.8841 100.00% 5120 10.7489 87.23% 5.8594
17 attn_q 842.77 0.0124 33.2876 0.1646 0.8841 100.00% 5120 10.7489 87.23% 5.8594
38 ffn_up 840.16 0.0088 2.6975 0.1641 0.0626 100.00% 5120 12.2701 99.58% 36.9141
38 ffn_gate 840.16 0.0088 2.6975 0.1641 0.0626 100.00% 5120 12.2701 99.58% 36.9141
35 ffn_up 835.32 0.0068 1.1382 0.1631 0.0333 100.00% 5120 12.3025 99.84% 40.2344
35 ffn_gate 835.32 0.0068 1.1382 0.1631 0.0333 100.00% 5120 12.3025 99.84% 40.2344
15 attn_q 820.47 0.0159 44.4388 0.1602 1.1185 100.00% 5120 10.2600 83.27% 5.2734
15 attn_v 820.47 0.0159 44.4388 0.1602 1.1185 100.00% 5120 10.2600 83.27% 5.2734
15 attn_k 820.47 0.0159 44.4388 0.1602 1.1185 100.00% 5120 10.2600 83.27% 5.2734
20 attn_k 810.73 0.0080 22.8515 0.1583 0.7303 100.00% 5120 10.9871 89.17% 6.0547
20 attn_v 810.73 0.0080 22.8515 0.1583 0.7303 100.00% 5120 10.9871 89.17% 6.0547
20 attn_q 810.73 0.0080 22.8515 0.1583 0.7303 100.00% 5120 10.9871 89.17% 6.0547
34 ffn_up 799.17 0.0067 1.0181 0.1561 0.0281 100.00% 5120 12.3064 99.87% 38.2812
34 ffn_gate 799.17 0.0067 1.0181 0.1561 0.0281 100.00% 5120 12.3064 99.87% 38.2812
12 attn_v 782.01 0.0126 46.9238 0.1527 1.2340 100.00% 5120 9.8808 80.19% 5.2734
12 attn_q 782.01 0.0126 46.9238 0.1527 1.2340 100.00% 5120 9.8808 80.19% 5.2734
12 attn_k 782.01 0.0126 46.9238 0.1527 1.2340 100.00% 5120 9.8808 80.19% 5.2734
33 ffn_up 764.58 0.0056 0.8259 0.1493 0.0239 100.00% 5120 12.3087 99.89% 46.4844
33 ffn_gate 764.58 0.0056 0.8259 0.1493 0.0239 100.00% 5120 12.3087 99.89% 46.4844
32 ffn_gate 736.26 0.0046 0.7709 0.1438 0.0227 100.00% 5120 12.3091 99.90% 45.8984
32 ffn_up 736.26 0.0046 0.7709 0.1438 0.0227 100.00% 5120 12.3091 99.90% 45.8984
10 attn_v 713.91 0.0092 39.3571 0.1394 1.0706 100.00% 5120 9.9807 81.00% 5.6641
10 attn_k 713.91 0.0092 39.3571 0.1394 1.0706 100.00% 5120 9.9807 81.00% 5.6641
10 attn_q 713.91 0.0092 39.3571 0.1394 1.0706 100.00% 5120 9.9807 81.00% 5.6641
9 attn_v 709.57 0.0059 35.1349 0.1386 0.9907 100.00% 5120 10.0564 81.61% 6.6406
9 attn_k 709.57 0.0059 35.1349 0.1386 0.9907 100.00% 5120 10.0564 81.61% 6.6406
9 attn_q 709.57 0.0059 35.1349 0.1386 0.9907 100.00% 5120 10.0564 81.61% 6.6406
31 ffn_gate 706.57 0.0035 0.5213 0.1380 0.0190 100.00% 5120 12.3114 99.91% 53.9062
31 ffn_up 706.57 0.0035 0.5213 0.1380 0.0190 100.00% 5120 12.3114 99.91% 53.9062
11 attn_k 695.69 0.0103 44.5534 0.1359 1.1356 100.00% 5120 9.7664 79.26% 5.4688
11 attn_q 695.69 0.0103 44.5534 0.1359 1.1356 100.00% 5120 9.7664 79.26% 5.4688
11 attn_v 695.69 0.0103 44.5534 0.1359 1.1356 100.00% 5120 9.7664 79.26% 5.4688
30 ffn_gate 678.07 0.0041 0.5778 0.1324 0.0203 100.00% 5120 12.3097 99.90% 47.6562
30 ffn_up 678.07 0.0041 0.5778 0.1324 0.0203 100.00% 5120 12.3097 99.90% 47.6562
39 ffn_gate 648.54 0.0191 5.6152 0.1267 0.0890 100.00% 5120 12.2396 99.33% 12.3047
39 ffn_up 648.54 0.0191 5.6152 0.1267 0.0890 100.00% 5120 12.2396 99.33% 12.3047
29 ffn_up 647.83 0.0048 0.4959 0.1265 0.0169 100.00% 5120 12.3115 99.92% 62.6953
29 ffn_gate 647.83 0.0048 0.4959 0.1265 0.0169 100.00% 5120 12.3115 99.92% 62.6953
28 ffn_up 621.34 0.0073 0.4593 0.1214 0.0171 100.00% 5120 12.3108 99.91% 59.5703
28 ffn_gate 621.34 0.0073 0.4593 0.1214 0.0171 100.00% 5120 12.3108 99.91% 59.5703
27 ffn_gate 596.51 0.0036 0.5035 0.1165 0.0176 100.00% 5120 12.3092 99.90% 63.4766
27 ffn_up 596.51 0.0036 0.5035 0.1165 0.0176 100.00% 5120 12.3092 99.90% 63.4766
8 attn_q 595.64 0.0067 34.9034 0.1163 0.8977 100.00% 5120 9.9023 80.36% 5.8594
8 attn_v 595.64 0.0067 34.9034 0.1163 0.8977 100.00% 5120 9.9023 80.36% 5.8594
8 attn_k 595.64 0.0067 34.9034 0.1163 0.8977 100.00% 5120 9.9023 80.36% 5.8594
37 ffn_down 592.02 0.0074 16.6926 0.0428 0.1790 100.00% 13824 12.6990 92.32% 25.3906
26 ffn_gate 568.09 0.0044 0.5478 0.1110 0.0182 100.00% 5120 12.3079 99.89% 53.3203
26 ffn_up 568.09 0.0044 0.5478 0.1110 0.0182 100.00% 5120 12.3079 99.89% 53.3203
25 ffn_gate 542.26 0.0052 0.5749 0.1059 0.0192 100.00% 5120 12.3055 99.87% 47.0703
25 ffn_up 542.26 0.0052 0.5749 0.1059 0.0192 100.00% 5120 12.3055 99.87% 47.0703
7 attn_k 536.38 0.0000 37.2838 0.1048 0.9200 99.98% 5120 9.3955 76.25% 6.6406
7 attn_q 536.38 0.0000 37.2838 0.1048 0.9200 99.98% 5120 9.3955 76.25% 6.6406
7 attn_v 536.38 0.0000 37.2838 0.1048 0.9200 99.98% 5120 9.3955 76.25% 6.6406
24 ffn_gate 513.76 0.0061 0.6509 0.1003 0.0216 100.00% 5120 12.3012 99.83% 37.5000
24 ffn_up 513.76 0.0061 0.6509 0.1003 0.0216 100.00% 5120 12.3012 99.83% 37.5000
6 attn_k 511.80 0.0000 34.5247 0.1000 0.7756 99.98% 5120 9.8035 79.56% 7.4219
6 attn_v 511.80 0.0000 34.5247 0.1000 0.7756 99.98% 5120 9.8035 79.56% 7.4219
6 attn_q 511.80 0.0000 34.5247 0.1000 0.7756 99.98% 5120 9.8035 79.56% 7.4219
36 ffn_down 493.83 0.0075 5.3032 0.0357 0.0743 100.00% 13824 13.0480 94.86% 44.4879
23 ffn_gate 488.15 0.0045 0.7809 0.0953 0.0255 100.00% 5120 12.2943 99.78% 17.9688
23 ffn_up 488.15 0.0045 0.7809 0.0953 0.0255 100.00% 5120 12.2943 99.78% 17.9688
22 ffn_up 461.78 0.0070 0.8592 0.0902 0.0298 100.00% 5120 12.2841 99.69% 12.8906
22 ffn_gate 461.78 0.0070 0.8592 0.0902 0.0298 100.00% 5120 12.2841 99.69% 12.8906
5 attn_k 461.03 0.0000 27.0042 0.0900 0.7100 99.96% 5120 9.4849 76.98% 8.9844
5 attn_v 461.03 0.0000 27.0042 0.0900 0.7100 99.96% 5120 9.4849 76.98% 8.9844
5 attn_q 461.03 0.0000 27.0042 0.0900 0.7100 99.96% 5120 9.4849 76.98% 8.9844
21 ffn_up 432.89 0.0068 1.0011 0.0845 0.0359 100.00% 5120 12.2675 99.56% 10.5469
21 ffn_gate 432.89 0.0068 1.0011 0.0845 0.0359 100.00% 5120 12.2675 99.56% 10.5469
4 attn_k 416.60 0.0000 25.1496 0.0814 0.6785 99.96% 5120 9.2580 75.13% 9.9609
4 attn_v 416.60 0.0000 25.1496 0.0814 0.6785 99.96% 5120 9.2580 75.13% 9.9609
4 attn_q 416.60 0.0000 25.1496 0.0814 0.6785 99.96% 5120 9.2580 75.13% 9.9609
35 ffn_down 411.85 0.0053 7.9751 0.0298 0.0819 100.00% 13824 13.0757 95.06% 28.2841
20 ffn_gate 403.55 0.0171 1.2925 0.0788 0.0435 100.00% 5120 12.2438 99.37% 8.7891
20 ffn_up 403.55 0.0171 1.2925 0.0788 0.0435 100.00% 5120 12.2438 99.37% 8.7891
19 ffn_gate 382.99 0.0103 1.2834 0.0748 0.0409 100.00% 5120 12.2452 99.38% 8.9844
19 ffn_up 382.99 0.0103 1.2834 0.0748 0.0409 100.00% 5120 12.2452 99.38% 8.9844
18 ffn_gate 360.11 0.0086 1.1621 0.0703 0.0419 100.00% 5120 12.2340 99.29% 9.1797
18 ffn_up 360.11 0.0086 1.1621 0.0703 0.0419 100.00% 5120 12.2340 99.29% 9.1797
34 ffn_down 343.68 0.0057 1.9176 0.0249 0.0342 100.00% 13824 13.3093 96.76% 43.4028
17 ffn_up 336.38 0.0122 1.4292 0.0657 0.0480 100.00% 5120 12.2045 99.05% 8.5938
17 ffn_gate 336.38 0.0122 1.4292 0.0657 0.0480 100.00% 5120 12.2045 99.05% 8.5938
16 ffn_gate 311.79 0.0122 1.7776 0.0609 0.0573 100.00% 5120 12.1552 98.65% 8.3984
16 ffn_up 311.79 0.0122 1.7776 0.0609 0.0573 100.00% 5120 12.1552 98.65% 8.3984
33 ffn_down 307.16 0.0097 7.3743 0.0222 0.0698 100.00% 13824 13.2318 96.20% 14.9740
15 ffn_up 288.24 0.0109 2.0467 0.0563 0.0615 100.00% 5120 12.1205 98.37% 8.0078
15 ffn_gate 288.24 0.0109 2.0467 0.0563 0.0615 100.00% 5120 12.1205 98.37% 8.0078
14 ffn_up 272.26 0.0103 2.6254 0.0532 0.0710 100.00% 5120 12.0645 97.91% 7.8125
14 ffn_gate 272.26 0.0103 2.6254 0.0532 0.0710 100.00% 5120 12.0645 97.91% 7.8125
32 ffn_down 270.24 0.0095 0.7403 0.0195 0.0193 100.00% 13824 13.4759 97.97% 46.8027
13 ffn_up 254.86 0.0113 2.6888 0.0498 0.0725 100.00% 5120 12.0363 97.68% 7.2266
13 ffn_gate 254.86 0.0113 2.6888 0.0498 0.0725 100.00% 5120 12.0363 97.68% 7.2266
31 ffn_down 250.66 0.0086 0.9231 0.0181 0.0188 100.00% 13824 13.4937 98.10% 43.7645
12 ffn_gate 239.95 0.0166 2.6666 0.0469 0.0752 100.00% 5120 11.9867 97.28% 7.2266
12 ffn_up 239.95 0.0166 2.6666 0.0469 0.0752 100.00% 5120 11.9867 97.28% 7.2266
30 ffn_down 237.44 0.0079 0.5803 0.0172 0.0149 100.00% 13824 13.5080 98.20% 50.7812
11 ffn_up 230.23 0.0148 2.8725 0.0450 0.0777 100.00% 5120 11.9567 97.04% 7.0312
11 ffn_gate 230.23 0.0148 2.8725 0.0450 0.0777 100.00% 5120 11.9567 97.04% 7.0312
29 ffn_down 227.64 0.0074 6.8119 0.0165 0.0593 100.00% 13824 13.3079 96.75% 7.5231
10 ffn_up 220.84 0.0059 2.3218 0.0431 0.0624 100.00% 5120 12.0437 97.74% 7.4219
10 ffn_gate 220.84 0.0059 2.3218 0.0431 0.0624 100.00% 5120 12.0437 97.74% 7.4219
39 attn_output 213.80 0.0049 1.7995 0.0418 0.0570 100.00% 5120 11.6992 94.95% 90.6250
3 attn_k 212.66 0.0000 17.1690 0.0415 0.4298 99.98% 5120 8.5517 69.40% 7.0312
3 attn_q 212.66 0.0000 17.1690 0.0415 0.4298 99.98% 5120 8.5517 69.40% 7.0312
3 attn_v 212.66 0.0000 17.1690 0.0415 0.4298 99.98% 5120 8.5517 69.40% 7.0312
9 ffn_gate 211.89 0.0064 1.9591 0.0414 0.0548 100.00% 5120 12.0596 97.87% 7.6172
9 ffn_up 211.89 0.0064 1.9591 0.0414 0.0548 100.00% 5120 12.0596 97.87% 7.6172
2 attn_v 211.81 0.0000 13.5470 0.0414 0.5105 99.86% 5120 7.5117 60.96% 5.0781
2 attn_q 211.81 0.0000 13.5470 0.0414 0.5105 99.86% 5120 7.5117 60.96% 5.0781
2 attn_k 211.81 0.0000 13.5470 0.0414 0.5105 99.86% 5120 7.5117 60.96% 5.0781
28 ffn_down 210.59 0.0071 0.7934 0.0152 0.0169 100.00% 13824 13.4661 97.90% 42.6794
27 ffn_down 204.54 0.0061 8.1876 0.0148 0.0705 100.00% 13824 13.2151 96.08% 4.0509
26 ffn_down 195.28 0.0058 3.9368 0.0141 0.0383 100.00% 13824 13.2929 96.64% 14.0336
8 ffn_gate 189.36 0.0115 1.6949 0.0370 0.0461 100.00% 5120 12.0880 98.10% 7.8125
8 ffn_up 189.36 0.0115 1.6949 0.0370 0.0461 100.00% 5120 12.0880 98.10% 7.8125
38 attn_output 185.57 0.0016 1.4583 0.0362 0.0547 100.00% 5120 11.5948 94.10% 53.1250
25 ffn_down 177.29 0.0051 0.8608 0.0128 0.0142 100.00% 13824 13.4412 97.72% 47.8877
24 ffn_down 167.83 0.0045 0.8385 0.0121 0.0184 100.00% 13824 13.3351 96.95% 32.1904
7 ffn_up 167.13 0.0085 1.2138 0.0326 0.0395 100.00% 5120 12.0921 98.13% 6.8359
7 ffn_gate 167.13 0.0085 1.2138 0.0326 0.0395 100.00% 5120 12.0921 98.13% 6.8359
23 ffn_down 161.22 0.0045 1.2035 0.0117 0.0192 100.00% 13824 13.3102 96.77% 31.1777
22 ffn_down 150.90 0.0038 0.8320 0.0109 0.0151 100.00% 13824 13.3489 97.05% 39.8582
1 attn_k 148.63 0.0000 22.4289 0.0290 0.5286 99.80% 5120 5.8192 47.23% 3.7109
1 attn_q 148.63 0.0000 22.4289 0.0290 0.5286 99.80% 5120 5.8192 47.23% 3.7109
1 attn_v 148.63 0.0000 22.4289 0.0290 0.5286 99.80% 5120 5.8192 47.23% 3.7109
21 ffn_down 147.96 0.0036 1.6641 0.0107 0.0245 100.00% 13824 13.1859 95.86% 19.8206
6 ffn_up 143.83 0.0134 0.7677 0.0281 0.0279 100.00% 5120 12.1471 98.58% 7.4219
6 ffn_gate 143.83 0.0134 0.7677 0.0281 0.0279 100.00% 5120 12.1471 98.58% 7.4219
37 attn_output 127.32 0.0007 1.2476 0.0249 0.0382 100.00% 5120 11.6690 94.70% 36.5234
36 attn_output 124.95 0.0022 0.7087 0.0244 0.0317 100.00% 5120 11.7572 95.42% 64.4531
20 ffn_down 119.81 0.0030 0.3580 0.0087 0.0095 100.00% 13824 13.4021 97.44% 53.0237
5 ffn_gate 114.26 0.0015 0.5836 0.0223 0.0180 100.00% 5120 12.1927 98.95% 8.2031
5 ffn_up 114.26 0.0015 0.5836 0.0223 0.0180 100.00% 5120 12.1927 98.95% 8.2031
19 ffn_down 110.82 0.0026 0.5981 0.0080 0.0117 100.00% 13824 13.3221 96.85% 37.1817
18 ffn_down 100.26 0.0026 1.6162 0.0073 0.0172 100.00% 13824 13.2686 96.46% 18.5185
17 ffn_down 91.33 0.0017 0.9219 0.0066 0.0102 100.00% 13824 13.3992 97.41% 30.8883
4 ffn_gate 87.21 0.0002 0.2963 0.0170 0.0101 100.00% 5120 12.2345 99.29% 10.5469
4 ffn_up 87.21 0.0002 0.2963 0.0170 0.0101 100.00% 5120 12.2345 99.29% 10.5469
16 ffn_down 83.68 0.0018 0.3795 0.0061 0.0068 100.00% 13824 13.4214 97.58% 46.2240
35 attn_output 80.93 0.0009 0.3628 0.0158 0.0178 100.00% 5120 11.8167 95.90% 67.3828
15 ffn_down 69.29 0.0015 0.4523 0.0050 0.0060 100.00% 13824 13.4392 97.70% 43.4028
34 attn_output 68.75 0.0018 0.3458 0.0134 0.0159 100.00% 5120 11.7593 95.43% 90.4297
3 ffn_gate 63.74 0.0000 0.9831 0.0124 0.0160 100.00% 5120 12.1360 98.49% 7.8125
3 ffn_up 63.74 0.0000 0.9831 0.0124 0.0160 100.00% 5120 12.1360 98.49% 7.8125
21 attn_output 63.53 0.0021 0.5559 0.0124 0.0145 100.00% 5120 11.8760 96.38% 53.7109
15 attn_output 63.25 0.0013 0.1506 0.0124 0.0118 100.00% 5120 11.9061 96.62% 81.6406
14 ffn_down 60.91 0.0014 0.3164 0.0044 0.0045 100.00% 13824 13.4907 98.08% 48.8281
32 attn_output 60.46 0.0005 0.4920 0.0118 0.0169 100.00% 5120 11.7173 95.09% 67.5781
14 attn_output 59.20 0.0033 0.2145 0.0116 0.0095 100.00% 5120 12.0477 97.77% 57.4219
31 attn_output 58.85 0.0005 0.4893 0.0115 0.0167 100.00% 5120 11.6401 94.47% 50.1953
16 attn_output 58.58 0.0012 0.1902 0.0114 0.0095 100.00% 5120 12.0063 97.44% 88.8672
17 attn_output 58.46 0.0005 0.2506 0.0114 0.0106 100.00% 5120 11.9494 96.98% 61.5234
33 attn_output 53.96 0.0014 0.2382 0.0105 0.0079 100.00% 5120 12.0467 97.77% 108.9844
24 attn_output 53.59 0.0005 0.5380 0.0105 0.0263 100.00% 5120 11.1589 90.56% 33.2031
13 ffn_down 53.16 0.0012 0.1572 0.0038 0.0035 100.00% 13824 13.5008 98.15% 50.1302
20 attn_output 52.53 0.0015 0.2461 0.0103 0.0114 100.00% 5120 11.8431 96.11% 75.1953
30 attn_output 50.85 0.0007 0.2020 0.0099 0.0085 100.00% 5120 11.9906 97.31% 95.5078
12 ffn_down 46.43 0.0004 0.0648 0.0034 0.0025 100.00% 13824 13.5358 98.41% 70.0231
11 ffn_down 44.24 0.0008 0.4759 0.0032 0.0049 100.00% 13824 13.4624 97.87% 23.6545
13 attn_output 43.56 0.0003 0.1377 0.0085 0.0073 100.00% 5120 11.9801 97.23% 63.0859
12 attn_output 43.40 0.0009 0.1860 0.0085 0.0078 100.00% 5120 11.9642 97.10% 72.8516
11 attn_output 42.74 0.0006 0.5558 0.0083 0.0176 100.00% 5120 11.4660 93.05% 50.1953
25 attn_output 42.61 0.0006 0.3259 0.0083 0.0095 100.00% 5120 11.8723 96.35% 69.9219
23 attn_output 42.58 0.0005 0.1831 0.0083 0.0095 100.00% 5120 11.7843 95.64% 62.6953
19 attn_output 42.16 0.0004 0.2335 0.0082 0.0076 100.00% 5120 12.0083 97.45% 41.7969
26 attn_output 41.73 0.0003 0.2064 0.0082 0.0076 100.00% 5120 11.9276 96.80% 79.4922
27 attn_output 41.03 0.0003 0.8884 0.0080 0.0141 100.00% 5120 11.8718 96.35% 25.7812
22 attn_output 40.76 0.0003 0.1580 0.0080 0.0071 100.00% 5120 11.8881 96.48% 99.6094
18 attn_output 40.68 0.0014 0.2471 0.0079 0.0069 100.00% 5120 12.0482 97.78% 57.2266
10 ffn_down 39.95 0.0006 0.1846 0.0029 0.0025 100.00% 13824 13.5468 98.49% 48.9728
2 ffn_up 38.98 0.0000 0.1812 0.0076 0.0036 100.00% 5120 12.2648 99.54% 7.4219
2 ffn_gate 38.98 0.0000 0.1812 0.0076 0.0036 100.00% 5120 12.2648 99.54% 7.4219
29 attn_output 38.72 0.0016 0.0977 0.0076 0.0053 100.00% 5120 12.0489 97.78% 130.2734
28 attn_output 38.28 0.0006 0.1802 0.0075 0.0064 100.00% 5120 11.9516 96.99% 131.0547
10 attn_output 36.31 0.0004 0.1589 0.0071 0.0085 100.00% 5120 11.7977 95.75% 60.7422
9 ffn_down 36.00 0.0006 0.7241 0.0026 0.0067 100.00% 13824 13.3678 97.19% 10.7784
8 ffn_down 30.51 0.0004 0.3576 0.0022 0.0042 100.00% 13824 13.3650 97.17% 20.4716
9 attn_output 25.89 0.0003 0.1683 0.0051 0.0074 100.00% 5120 11.6535 94.58% 51.5625
7 ffn_down 25.57 0.0002 0.3904 0.0018 0.0055 100.00% 13824 13.1784 95.81% 9.4763
6 ffn_down 18.29 0.0003 0.1456 0.0013 0.0018 100.00% 13824 13.4276 97.62% 35.3733
0 attn_q 18.29 0.0000 5.9196 0.0036 0.0950 94.32% 5120 4.4566 36.17% 4.8828
0 attn_k 18.29 0.0000 5.9196 0.0036 0.0950 94.32% 5120 4.4566 36.17% 4.8828
0 attn_v 18.29 0.0000 5.9196 0.0036 0.0950 94.32% 5120 4.4566 36.17% 4.8828
8 attn_output 17.56 0.0001 0.0978 0.0034 0.0039 100.00% 5120 11.8420 96.10% 55.8594
1 ffn_gate 17.11 0.0000 0.5277 0.0033 0.0083 100.00% 5120 11.9241 96.77% 5.0781
1 ffn_up 17.11 0.0000 0.5277 0.0033 0.0083 100.00% 5120 11.9241 96.77% 5.0781
7 attn_output 13.82 0.0001 0.0629 0.0027 0.0034 100.00% 5120 11.7857 95.65% 51.5625
5 ffn_down 12.69 0.0001 0.3858 0.0009 0.0034 100.00% 13824 13.2589 96.39% 7.2338
6 attn_output 9.60 0.0000 0.0566 0.0019 0.0026 100.00% 5120 11.6751 94.75% 54.8828
4 ffn_down 7.48 0.0001 0.0299 0.0005 0.0006 100.00% 13824 13.4405 97.71% 54.4705
0 ffn_gate 7.24 0.0000 0.3432 0.0014 0.0109 99.94% 5120 9.7065 78.77% 6.4453
0 ffn_up 7.24 0.0000 0.3432 0.0014 0.0109 99.94% 5120 9.7065 78.77% 6.4453
5 attn_output 6.31 0.0000 0.0573 0.0012 0.0018 100.00% 5120 11.7298 95.19% 33.3984
4 attn_output 4.28 0.0000 0.0411 0.0008 0.0016 100.00% 5120 11.5801 93.98% 32.4219
0 ffn_down 4.25 0.0000 3.6589 0.0003 0.0312 99.73% 13824 1.6508 12.00% 0.1447
3 attn_output 3.57 0.0000 0.0637 0.0007 0.0025 100.00% 5120 10.5307 85.46% 26.9531
2 ffn_down 2.67 0.0000 0.0087 0.0002 0.0002 100.00% 13824 13.3953 97.39% 44.5602
1 ffn_down 2.13 0.0000 0.6453 0.0002 0.0061 100.00% 13824 8.4307 61.29% 0.3617
2 attn_output 1.46 0.0000 0.0200 0.0003 0.0005 100.00% 5120 11.4702 93.09% 42.7734
1 attn_output 1.05 0.0000 0.0229 0.0002 0.0006 100.00% 5120 10.2723 83.37% 50.5859
0 attn_output 0.46 0.0000 0.0577 0.0001 0.0011 90.25% 5120 7.1328 57.89% 12.8906
Graph of Entropy & ZD Score by Layer and Tensor
Discussion
So I'm not sure how best to read these stats and interpret the graphs. According to the Layer-wise Quantization Paper the top 3 most important layers according to their LIM Score are 1, 2, and 40. The least important being 32, 33, and 34. However, I don't see a correlation in the graphs at least with Entropy and what you are calling "ZD Score"*
*Just to confirm, what you are calling "ZD Score" is calculated using the imatrix activations whereas in the paper it is defined as all weights in a given layer, (my emphasis):
We examine the proportion of weights in a layer exhibiting a z-score greater than 1. where for layer Li, wi represents an individual weight, µ the mean of the weights, and σ their standard deviation.
Anyway, just some observations. I didn't slice the data to look at the other metrics nor try to normalize all the tensors of a given layer togther into a single "layer" score.
Fascinating stuff, hopefully I can dig in more later this week! Cheers!
Fascinating stuff indeed @ubergarm, and apparently not without controversy 🙃
In a room full of PhDs, I'd be Howard Wolowitz 🤣 so, dear reader, please take everything that follows with the proverbial pinch of salt, and do not pull back from pointing out errors or gaps in my logic.
The notion of determining the importance of a specific tensor in a specific layer by somehow measuring the degree of transformation of the hidden states (be it with importance scores, cosine similarity, etc.) as the tokens "flow" from that layer to the next seems -intuitively- reasonable to me and, as few have correctly pointed out, having access to the weights during those transformations will yield significantly better measurements.
In my case however, and for the reasons explained above, I'm left with the next best option, which is the sum of the squared activations (imatrix importance scores) for specific tensors in specific layers. That's what I'm calling Σ(Bias), in reference to total "power" in a vector of discrete signals (sum or the squared elements in the vector). The intuition is that the more bias there is, the busier the tensor. That's as far as I dare to take the EE analogy 😉.
I'm emphasising specific tensor & specific layer to signify that the stats should be used to compare between tensors of the same type only. In other words, thinking that attn_k in layer X has more influence during inference than attn_k in layer Y because its Σ(Bias) is larger makes sense, whilst concluding the same between attn_k and ffn_down does not. I've just pushed a change in how the stats are displayed to better convey this.
To validate the hypothesis we of course need lots of tests, but so far, and based solely on layer-wise quantizing DeepSeek-R1-Distill-Qwen-7B, it seems to hold (approach and results in my previous comment 👆 and corresponding imatrix stats at the end 👇 ). Testing other models is needed, but so far so good.
I have indeed taken the paper's ZD concept and applied it to the activations. Their Z-score Distribution (a better name would be z-score density, IMO) is nothing more than the percentage of elements that have a z-score greater than 1 standard deviation from the mean.
I haven't had a chance to really grok the relevance of this metric, but suspect that in combination with the normalized entropy it may give insights into whole layer scoring, but that's a (pruning) story for another day...
Computing statistics for imatrix-DeepSeek-R1-Distill-Qwen-7B-small.dat (197 tensors)
Layer Tensor Σ(Bias) Min Max μ σ % Active N Entropy E (norm) ZD Score
==========================================================================================================================================================================
27 attn_k 5141.31 0.0578 405.6018 1.4345 8.5063 100.00% 3584 8.2161 69.58% 5.05%
26 attn_k 3514.78 0.0014 336.0238 0.9807 6.3577 100.00% 3584 8.6701 73.43% 4.77%
23 attn_k 2577.34 0.0711 107.3467 0.7191 2.8482 100.00% 3584 9.2976 78.74% 5.36%
25 attn_k 2416.49 0.0523 192.7465 0.6742 3.6958 100.00% 3584 9.4202 79.78% 4.85%
24 attn_k 2345.51 0.0433 235.1290 0.6544 4.3505 100.00% 3584 9.3335 79.05% 2.68%
22 attn_k 2341.42 0.0616 106.0560 0.6533 2.9773 100.00% 3584 9.3443 79.14% 2.87%
21 attn_k 1465.48 0.0488 65.1086 0.4089 1.8415 100.00% 3584 9.7659 82.71% 1.95%
19 attn_k 1354.92 0.0160 64.9419 0.3780 2.0088 100.00% 3584 9.4633 80.15% 1.79%
20 attn_k 1271.46 0.0245 58.6785 0.3548 1.7495 100.00% 3584 9.6939 82.10% 1.84%
16 attn_k 1217.92 0.0000 68.7396 0.3398 1.8574 100.00% 3584 9.2844 78.63% 1.81%
17 attn_k 1193.92 0.0139 50.0219 0.3331 1.5332 100.00% 3584 9.6450 81.69% 1.90%
14 attn_k 1188.44 0.0079 48.7036 0.3316 1.4011 100.00% 3584 9.6869 82.04% 2.37%
18 attn_k 1001.68 0.0072 54.0705 0.2795 1.4768 100.00% 3584 9.6582 81.80% 1.48%
15 attn_k 923.17 0.0020 32.2622 0.2576 1.1821 100.00% 3584 9.4031 79.64% 2.46%
8 attn_k 784.03 0.0082 12.9517 0.2188 0.6849 100.00% 3584 10.1589 86.04% 2.85%
13 attn_k 752.92 0.0000 25.2086 0.2101 0.7649 99.97% 3584 10.2496 86.81% 1.87%
12 attn_k 738.25 0.0061 24.0529 0.2060 0.7757 100.00% 3584 10.1182 85.69% 1.90%
9 attn_k 733.39 0.0000 16.4946 0.2046 0.6262 100.00% 3584 10.5356 89.23% 2.20%
4 attn_k 689.25 0.0000 26.4802 0.1923 1.1755 98.80% 3584 8.4224 71.33% 1.76%
5 attn_k 687.23 0.0000 31.9846 0.1917 0.7180 99.89% 3584 10.1248 85.75% 2.54%
11 attn_k 685.48 0.0080 17.6951 0.1913 0.7004 100.00% 3584 10.0526 85.14% 2.20%
10 attn_k 630.31 0.0076 16.3245 0.1759 0.6634 100.00% 3584 10.1971 86.36% 2.01%
7 attn_k 615.92 0.0000 12.5285 0.1719 0.5429 100.00% 3584 10.4200 88.25% 1.87%
6 attn_k 499.66 0.0000 16.2125 0.1394 0.6909 99.89% 3584 9.6434 81.67% 1.31%
3 attn_k 308.74 0.0000 11.9797 0.0861 0.3259 98.07% 3584 9.5947 81.26% 4.94%
2 attn_k 258.92 0.0000 7.6345 0.0722 0.2554 94.81% 3584 9.8862 83.73% 3.26%
0 attn_k 120.98 0.0000 11.3855 0.0338 0.1961 99.97% 3584 10.8332 91.75% 0.39%
1 attn_k 68.39 0.0000 7.4842 0.0191 0.1749 86.05% 3584 7.8550 66.53% 1.34%
27 attn_output 5664.79 0.1570 47.1631 1.5806 2.8290 100.00% 3584 10.9222 92.50% 5.97%
26 attn_output 1455.48 0.0136 36.9886 0.4061 1.5633 100.00% 3584 10.6218 89.96% 0.67%
23 attn_output 1162.73 0.0292 28.5696 0.3244 1.2175 100.00% 3584 10.4851 88.80% 0.78%
25 attn_output 1087.16 0.0556 39.0104 0.3033 1.6812 100.00% 3584 10.1333 85.82% 0.25%
24 attn_output 802.42 0.0178 12.8809 0.2239 0.5729 100.00% 3584 10.9313 92.58% 1.53%
21 attn_output 583.25 0.0091 3.4697 0.1627 0.2657 100.00% 3584 10.8242 91.67% 7.00%
19 attn_output 574.93 0.0103 4.3428 0.1604 0.3092 100.00% 3584 10.6549 90.24% 7.37%
18 attn_output 498.09 0.0091 5.5657 0.1390 0.2735 100.00% 3584 10.7222 90.81% 7.34%
22 attn_output 394.58 0.0023 3.4242 0.1101 0.1788 100.00% 3584 11.0570 93.65% 4.05%
20 attn_output 387.68 0.0086 6.0710 0.1082 0.2653 100.00% 3584 10.8025 91.49% 2.59%
16 attn_output 313.86 0.0044 4.4249 0.0876 0.1933 100.00% 3584 10.7883 91.37% 3.93%
15 attn_output 297.66 0.0015 2.4456 0.0831 0.1524 100.00% 3584 10.8274 91.70% 5.41%
13 attn_output 272.14 0.0090 4.0031 0.0759 0.1406 100.00% 3584 10.8771 92.12% 6.70%
17 attn_output 267.64 0.0045 5.3183 0.0747 0.2063 100.00% 3584 10.5521 89.37% 2.93%
14 attn_output 259.32 0.0005 12.2898 0.0724 0.2893 100.00% 3584 10.1023 85.56% 2.73%
12 attn_output 201.57 0.0050 3.6905 0.0562 0.1336 100.00% 3584 10.6677 90.35% 5.22%
11 attn_output 184.43 0.0049 2.6849 0.0515 0.0968 100.00% 3584 11.0717 93.77% 3.71%
7 attn_output 169.21 0.0022 0.4015 0.0472 0.0414 100.00% 3584 11.3066 95.76% 14.56%
9 attn_output 166.98 0.0021 1.5864 0.0466 0.0605 100.00% 3584 11.1723 94.62% 5.69%
10 attn_output 165.81 0.0026 0.9828 0.0463 0.0536 100.00% 3584 11.3118 95.80% 5.94%
8 attn_output 159.54 0.0019 1.1831 0.0445 0.0583 100.00% 3584 11.1678 94.58% 7.00%
0 attn_output 131.48 0.0005 6.6774 0.0367 0.2584 100.00% 3584 8.9836 76.08% 0.98%
6 attn_output 86.10 0.0007 0.3468 0.0240 0.0258 100.00% 3584 11.2370 95.17% 7.65%
3 attn_output 74.09 0.0010 0.5955 0.0207 0.0225 100.00% 3584 11.2807 95.54% 8.45%
4 attn_output 51.35 0.0002 0.9319 0.0143 0.0335 100.00% 3584 10.8659 92.03% 2.20%
5 attn_output 46.97 0.0011 0.4940 0.0131 0.0244 100.00% 3584 10.9951 93.12% 4.19%
2 attn_output 36.31 0.0010 0.9631 0.0101 0.0260 100.00% 3584 10.8809 92.15% 3.10%
1 attn_output 23.60 0.0001 0.4081 0.0066 0.0181 100.00% 3584 10.5325 89.20% 3.18%
27 attn_q 5141.31 0.0578 405.6018 1.4345 8.5063 100.00% 3584 8.2161 69.58% 5.05%
26 attn_q 3514.78 0.0014 336.0238 0.9807 6.3577 100.00% 3584 8.6701 73.43% 4.77%
23 attn_q 2577.34 0.0711 107.3467 0.7191 2.8482 100.00% 3584 9.2976 78.74% 5.36%
25 attn_q 2416.49 0.0523 192.7465 0.6742 3.6958 100.00% 3584 9.4202 79.78% 4.85%
24 attn_q 2345.51 0.0433 235.1290 0.6544 4.3505 100.00% 3584 9.3335 79.05% 2.68%
22 attn_q 2341.42 0.0616 106.0560 0.6533 2.9773 100.00% 3584 9.3443 79.14% 2.87%
21 attn_q 1465.48 0.0488 65.1086 0.4089 1.8415 100.00% 3584 9.7659 82.71% 1.95%
19 attn_q 1354.92 0.0160 64.9419 0.3780 2.0088 100.00% 3584 9.4633 80.15% 1.79%
20 attn_q 1271.46 0.0245 58.6785 0.3548 1.7495 100.00% 3584 9.6939 82.10% 1.84%
16 attn_q 1217.92 0.0000 68.7396 0.3398 1.8574 100.00% 3584 9.2844 78.63% 1.81%
17 attn_q 1193.92 0.0139 50.0219 0.3331 1.5332 100.00% 3584 9.6450 81.69% 1.90%
14 attn_q 1188.44 0.0079 48.7036 0.3316 1.4011 100.00% 3584 9.6869 82.04% 2.37%
18 attn_q 1001.68 0.0072 54.0705 0.2795 1.4768 100.00% 3584 9.6582 81.80% 1.48%
15 attn_q 923.17 0.0020 32.2622 0.2576 1.1821 100.00% 3584 9.4031 79.64% 2.46%
8 attn_q 784.03 0.0082 12.9517 0.2188 0.6849 100.00% 3584 10.1589 86.04% 2.85%
13 attn_q 752.92 0.0000 25.2086 0.2101 0.7649 99.97% 3584 10.2496 86.81% 1.87%
12 attn_q 738.25 0.0061 24.0529 0.2060 0.7757 100.00% 3584 10.1182 85.69% 1.90%
9 attn_q 733.39 0.0000 16.4946 0.2046 0.6262 100.00% 3584 10.5356 89.23% 2.20%
4 attn_q 689.25 0.0000 26.4802 0.1923 1.1755 98.80% 3584 8.4224 71.33% 1.76%
5 attn_q 687.23 0.0000 31.9846 0.1917 0.7180 99.89% 3584 10.1248 85.75% 2.54%
11 attn_q 685.48 0.0080 17.6951 0.1913 0.7004 100.00% 3584 10.0526 85.14% 2.20%
10 attn_q 630.31 0.0076 16.3245 0.1759 0.6634 100.00% 3584 10.1971 86.36% 2.01%
7 attn_q 615.92 0.0000 12.5285 0.1719 0.5429 100.00% 3584 10.4200 88.25% 1.87%
6 attn_q 499.66 0.0000 16.2125 0.1394 0.6909 99.89% 3584 9.6434 81.67% 1.31%
3 attn_q 308.74 0.0000 11.9797 0.0861 0.3259 98.07% 3584 9.5947 81.26% 4.94%
2 attn_q 258.92 0.0000 7.6345 0.0722 0.2554 94.81% 3584 9.8862 83.73% 3.26%
0 attn_q 120.98 0.0000 11.3855 0.0338 0.1961 99.97% 3584 10.8332 91.75% 0.39%
1 attn_q 68.39 0.0000 7.4842 0.0191 0.1749 86.05% 3584 7.8550 66.53% 1.34%
27 attn_v 5141.31 0.0578 405.6018 1.4345 8.5063 100.00% 3584 8.2161 69.58% 5.05%
26 attn_v 3514.78 0.0014 336.0238 0.9807 6.3577 100.00% 3584 8.6701 73.43% 4.77%
23 attn_v 2577.34 0.0711 107.3467 0.7191 2.8482 100.00% 3584 9.2976 78.74% 5.36%
25 attn_v 2416.49 0.0523 192.7465 0.6742 3.6958 100.00% 3584 9.4202 79.78% 4.85%
24 attn_v 2345.51 0.0433 235.1290 0.6544 4.3505 100.00% 3584 9.3335 79.05% 2.68%
22 attn_v 2341.42 0.0616 106.0560 0.6533 2.9773 100.00% 3584 9.3443 79.14% 2.87%
21 attn_v 1465.48 0.0488 65.1086 0.4089 1.8415 100.00% 3584 9.7659 82.71% 1.95%
19 attn_v 1354.92 0.0160 64.9419 0.3780 2.0088 100.00% 3584 9.4633 80.15% 1.79%
20 attn_v 1271.46 0.0245 58.6785 0.3548 1.7495 100.00% 3584 9.6939 82.10% 1.84%
16 attn_v 1217.92 0.0000 68.7396 0.3398 1.8574 100.00% 3584 9.2844 78.63% 1.81%
17 attn_v 1193.92 0.0139 50.0219 0.3331 1.5332 100.00% 3584 9.6450 81.69% 1.90%
14 attn_v 1188.44 0.0079 48.7036 0.3316 1.4011 100.00% 3584 9.6869 82.04% 2.37%
18 attn_v 1001.68 0.0072 54.0705 0.2795 1.4768 100.00% 3584 9.6582 81.80% 1.48%
15 attn_v 923.17 0.0020 32.2622 0.2576 1.1821 100.00% 3584 9.4031 79.64% 2.46%
8 attn_v 784.03 0.0082 12.9517 0.2188 0.6849 100.00% 3584 10.1589 86.04% 2.85%
13 attn_v 752.92 0.0000 25.2086 0.2101 0.7649 99.97% 3584 10.2496 86.81% 1.87%
12 attn_v 738.25 0.0061 24.0529 0.2060 0.7757 100.00% 3584 10.1182 85.69% 1.90%
9 attn_v 733.39 0.0000 16.4946 0.2046 0.6262 100.00% 3584 10.5356 89.23% 2.20%
4 attn_v 689.25 0.0000 26.4802 0.1923 1.1755 98.80% 3584 8.4224 71.33% 1.76%
5 attn_v 687.23 0.0000 31.9846 0.1917 0.7180 99.89% 3584 10.1248 85.75% 2.54%
11 attn_v 685.48 0.0080 17.6951 0.1913 0.7004 100.00% 3584 10.0526 85.14% 2.20%
10 attn_v 630.31 0.0076 16.3245 0.1759 0.6634 100.00% 3584 10.1971 86.36% 2.01%
7 attn_v 615.92 0.0000 12.5285 0.1719 0.5429 100.00% 3584 10.4200 88.25% 1.87%
6 attn_v 499.66 0.0000 16.2125 0.1394 0.6909 99.89% 3584 9.6434 81.67% 1.31%
3 attn_v 308.74 0.0000 11.9797 0.0861 0.3259 98.07% 3584 9.5947 81.26% 4.94%
2 attn_v 258.92 0.0000 7.6345 0.0722 0.2554 94.81% 3584 9.8862 83.73% 3.26%
0 attn_v 120.98 0.0000 11.3855 0.0338 0.1961 99.97% 3584 10.8332 91.75% 0.39%
1 attn_v 68.39 0.0000 7.4842 0.0191 0.1749 86.05% 3584 7.8550 66.53% 1.34%
27 ffn_down 355884.75 0.0159 6837.1255 18.7861 148.5242 100.00% 18944 10.7816 75.88% 1.45%
26 ffn_down 181419.47 0.0260 43328.5547 9.5766 321.8018 100.00% 18944 9.1996 64.74% 0.10%
25 ffn_down 38754.11 0.0107 2872.8489 2.0457 36.8919 100.00% 18944 10.0465 70.70% 0.26%
24 ffn_down 19443.91 0.0114 2827.7163 1.0264 21.8617 100.00% 18944 10.4168 73.31% 0.28%
23 ffn_down 12473.19 0.0139 1799.1183 0.6584 13.9010 100.00% 18944 10.7399 75.58% 0.31%
3 ffn_down 10822.42 0.0001 989.6157 0.5713 12.3155 100.00% 18944 6.5990 46.44% 0.57%
22 ffn_down 8961.94 0.0151 933.6822 0.4731 7.0275 100.00% 18944 11.4126 80.32% 0.62%
21 ffn_down 3950.82 0.0160 84.4493 0.2086 0.8990 100.00% 18944 12.4962 87.94% 2.68%
4 ffn_down 3913.25 0.0001 1316.8596 0.2066 13.8787 100.00% 18944 3.8574 27.15% 0.07%
20 ffn_down 2835.57 0.0176 104.7299 0.1497 1.0732 100.00% 18944 12.2692 86.35% 1.29%
11 ffn_down 1457.54 0.0101 889.8758 0.0769 6.4658 100.00% 18944 6.2602 44.06% 0.01%
19 ffn_down 1415.36 0.0098 18.9129 0.0747 0.2602 100.00% 18944 13.0607 91.92% 2.28%
18 ffn_down 1172.48 0.0037 47.6772 0.0619 0.3838 100.00% 18944 12.8918 90.73% 1.00%
9 ffn_down 984.12 0.0029 16.6916 0.0519 0.1486 100.00% 18944 13.4853 94.90% 1.73%
17 ffn_down 937.13 0.0120 47.2292 0.0495 0.3552 100.00% 18944 13.1493 92.54% 0.52%
7 ffn_down 741.61 0.0056 5.7790 0.0391 0.0622 100.00% 18944 13.7068 96.46% 4.46%
8 ffn_down 733.18 0.0076 10.2930 0.0387 0.0886 100.00% 18944 13.7211 96.56% 2.00%
15 ffn_down 711.79 0.0076 13.4870 0.0376 0.1184 100.00% 18944 13.4602 94.73% 1.81%
16 ffn_down 711.00 0.0110 8.7637 0.0375 0.0839 100.00% 18944 13.6264 95.90% 2.70%
6 ffn_down 693.73 0.0018 3.3237 0.0366 0.0686 100.00% 18944 13.4328 94.53% 4.30%
14 ffn_down 674.16 0.0091 4.7583 0.0356 0.0729 100.00% 18944 13.5277 95.20% 3.30%
12 ffn_down 628.72 0.0093 11.2445 0.0332 0.1058 100.00% 18944 13.4942 94.97% 1.56%
10 ffn_down 628.51 0.0083 6.9205 0.0332 0.0651 100.00% 18944 13.7703 96.91% 2.26%
13 ffn_down 623.54 0.0070 14.6682 0.0329 0.1219 100.00% 18944 13.4610 94.73% 1.36%
5 ffn_down 425.43 0.0001 65.9802 0.0225 0.4873 100.00% 18944 11.1274 78.31% 0.18%
2 ffn_down 362.44 0.0000 1.6931 0.0191 0.0493 83.49% 18944 12.4262 87.45% 6.37%
1 ffn_down 161.42 0.0000 1.9775 0.0085 0.0446 61.76% 18944 10.9874 77.32% 2.76%
0 ffn_down 93.17 0.0000 1.3730 0.0049 0.0183 100.00% 18944 12.3459 86.88% 3.40%
27 ffn_gate 8203.51 0.0000 728.1832 2.2889 15.0930 99.97% 3584 10.3009 87.24% 0.70%
1 ffn_gate 7649.28 0.0000 4250.2856 2.1343 73.6208 100.00% 3584 3.2319 27.37% 0.22%
5 ffn_gate 5793.46 0.2630 1696.2799 1.6165 30.9683 100.00% 3584 6.4787 54.87% 0.39%
26 ffn_gate 4977.79 0.0001 346.2318 1.3889 7.1514 100.00% 3584 10.3352 87.53% 1.03%
3 ffn_gate 4928.84 0.1158 1178.4656 1.3752 24.0211 100.00% 3584 6.1368 51.97% 0.36%
25 ffn_gate 4345.41 0.0000 391.9680 1.2124 7.5277 100.00% 3584 10.3049 87.28% 0.78%
2 ffn_gate 4145.53 0.0000 1567.8757 1.1567 28.9319 99.97% 3584 4.7073 39.87% 0.28%
4 ffn_gate 3605.02 0.0000 501.6380 1.0059 13.3321 100.00% 3584 7.2867 61.71% 0.45%
24 ffn_gate 3309.81 0.0000 221.9663 0.9235 5.1778 100.00% 3584 10.4013 88.09% 0.92%
23 ffn_gate 2978.69 0.0000 253.4090 0.8311 4.8293 100.00% 3584 10.3654 87.79% 0.73%
22 ffn_gate 2140.05 0.0000 152.6495 0.5971 3.2064 99.97% 3584 10.3133 87.35% 0.78%
9 ffn_gate 1605.21 0.0000 138.4068 0.4479 2.8957 100.00% 3584 10.2616 86.91% 0.45%
21 ffn_gate 1491.98 0.0000 89.1156 0.4163 1.9106 100.00% 3584 10.4835 88.79% 1.00%
20 ffn_gate 1104.55 0.0000 61.6396 0.3082 1.3331 100.00% 3584 10.6024 89.79% 1.23%
19 ffn_gate 923.42 0.0000 54.8742 0.2577 1.1880 100.00% 3584 10.5703 89.52% 1.12%
6 ffn_gate 795.71 0.0000 179.4834 0.2220 3.0785 100.00% 3584 9.2320 78.19% 0.20%
18 ffn_gate 764.25 0.0000 53.1881 0.2132 1.0228 99.97% 3584 10.6846 90.49% 0.81%
17 ffn_gate 696.13 0.0000 44.8044 0.1942 0.8129 99.97% 3584 10.9804 93.00% 0.73%
10 ffn_gate 627.04 0.0000 32.8056 0.1750 0.6096 100.00% 3584 11.1592 94.51% 0.64%
8 ffn_gate 614.92 0.0000 19.9203 0.1716 0.4671 99.97% 3584 11.2903 95.62% 0.50%
16 ffn_gate 612.27 0.0000 32.4457 0.1708 0.6095 99.97% 3584 11.0999 94.01% 0.73%
14 ffn_gate 605.78 0.0000 30.1453 0.1690 0.6111 100.00% 3584 11.0724 93.78% 0.70%
15 ffn_gate 584.58 0.0000 27.9312 0.1631 0.5630 99.97% 3584 11.0423 93.52% 1.00%
7 ffn_gate 581.64 0.0000 21.0149 0.1623 0.5479 99.92% 3584 11.1018 94.02% 0.47%
13 ffn_gate 561.19 0.0000 22.6935 0.1566 0.4936 99.97% 3584 11.1464 94.40% 0.73%
11 ffn_gate 552.21 0.0000 22.1247 0.1541 0.4128 99.97% 3584 11.3085 95.78% 0.67%
12 ffn_gate 531.12 0.0000 16.9325 0.1482 0.3588 99.97% 3584 11.3057 95.75% 0.81%
0 ffn_gate 113.10 0.0000 45.3427 0.0316 0.7576 99.58% 3584 7.6704 64.96% 0.06%
27 ffn_up 8203.51 0.0000 728.1832 2.2889 15.0930 99.97% 3584 10.3009 87.24% 0.70%
1 ffn_up 7649.28 0.0000 4250.2856 2.1343 73.6208 100.00% 3584 3.2319 27.37% 0.22%
5 ffn_up 5793.46 0.2630 1696.2799 1.6165 30.9683 100.00% 3584 6.4787 54.87% 0.39%
26 ffn_up 4977.79 0.0001 346.2318 1.3889 7.1514 100.00% 3584 10.3352 87.53% 1.03%
3 ffn_up 4928.84 0.1158 1178.4656 1.3752 24.0211 100.00% 3584 6.1368 51.97% 0.36%
25 ffn_up 4345.41 0.0000 391.9680 1.2124 7.5277 100.00% 3584 10.3049 87.28% 0.78%
2 ffn_up 4145.53 0.0000 1567.8757 1.1567 28.9319 99.97% 3584 4.7073 39.87% 0.28%
4 ffn_up 3605.02 0.0000 501.6380 1.0059 13.3321 100.00% 3584 7.2867 61.71% 0.45%
24 ffn_up 3309.81 0.0000 221.9663 0.9235 5.1778 100.00% 3584 10.4013 88.09% 0.92%
23 ffn_up 2978.69 0.0000 253.4090 0.8311 4.8293 100.00% 3584 10.3654 87.79% 0.73%
22 ffn_up 2140.05 0.0000 152.6495 0.5971 3.2064 99.97% 3584 10.3133 87.35% 0.78%
9 ffn_up 1605.21 0.0000 138.4068 0.4479 2.8957 100.00% 3584 10.2616 86.91% 0.45%
21 ffn_up 1491.98 0.0000 89.1156 0.4163 1.9106 100.00% 3584 10.4835 88.79% 1.00%
20 ffn_up 1104.55 0.0000 61.6396 0.3082 1.3331 100.00% 3584 10.6024 89.79% 1.23%
19 ffn_up 923.42 0.0000 54.8742 0.2577 1.1880 100.00% 3584 10.5703 89.52% 1.12%
6 ffn_up 795.71 0.0000 179.4834 0.2220 3.0785 100.00% 3584 9.2320 78.19% 0.20%
18 ffn_up 764.25 0.0000 53.1881 0.2132 1.0228 99.97% 3584 10.6846 90.49% 0.81%
17 ffn_up 696.13 0.0000 44.8044 0.1942 0.8129 99.97% 3584 10.9804 93.00% 0.73%
10 ffn_up 627.04 0.0000 32.8056 0.1750 0.6096 100.00% 3584 11.1592 94.51% 0.64%
8 ffn_up 614.92 0.0000 19.9203 0.1716 0.4671 99.97% 3584 11.2903 95.62% 0.50%
16 ffn_up 612.27 0.0000 32.4457 0.1708 0.6095 99.97% 3584 11.0999 94.01% 0.73%
14 ffn_up 605.78 0.0000 30.1453 0.1690 0.6111 100.00% 3584 11.0724 93.78% 0.70%
15 ffn_up 584.58 0.0000 27.9312 0.1631 0.5630 99.97% 3584 11.0423 93.52% 1.00%
7 ffn_up 581.64 0.0000 21.0149 0.1623 0.5479 99.92% 3584 11.1018 94.02% 0.47%
13 ffn_up 561.19 0.0000 22.6935 0.1566 0.4936 99.97% 3584 11.1464 94.40% 0.73%
11 ffn_up 552.21 0.0000 22.1247 0.1541 0.4128 99.97% 3584 11.3085 95.78% 0.67%
12 ffn_up 531.12 0.0000 16.9325 0.1482 0.3588 99.97% 3584 11.3057 95.75% 0.81%
0 ffn_up 113.10 0.0000 45.3427 0.0316 0.7576 99.58% 3584 7.6704 64.96% 0.06%
- output 37753.27 2.9640 3264.6670 10.5338 70.4707 100.00% 3584 9.4367 79.92% 1.42%
Added cosine similarity between same type tensors with respect to the previous layer (i.e. blk.7.attn_k and blk.6.attn_k)
Apologies for shotgun approach @ngxson / @jukofyork / @compilade, I'm not sure what the proper process to request a review is. Happy to close or move to draft if it's not suitable for merging
Thanks @EAddario for keeping this line of research open. One of the too many things I'm interested in checking out is your stats across a few competitive imatrix / quant providers e.g. per this discussion https://github.com/ikawrakow/ik_llama.cpp/discussions/359#discussioncomment-13021815 as folks are digging into the latest quantization trends and how much they differ and how to meaningfully compare.
Some way to visualize the results side by side would probably be easier on my brain than looking at the giant tables of stats.. I'll noodle on that.
Anyway, much thanks from a fellow hacker engineer! :)
@EAddario
I vibe coded up some python/image magick scripts to visualize the output of your --show-statistics to compare three imatrix files for Qwen3-30B-A3B from myself, unsloth, and bartowski.
I'm not really sure how to read it, and for the most part they seem to have similar patterns though with some discrepancies. They are not normalized to each other, just a stacked mosaic which was easiest to quickly "visually diff" them.
https://gist.github.com/ubergarm/2aa9327f7b98a9b16fef62b4941c7e76
@ubergarm, sorry for delayed reply, it was a hectic week at work. Love the visualizations and your reddit post btw.
The discrepancies are due to using different calibration files to generate the respective imatricies. On quick inspection, your imatrix seems to have "exercised" more weights (it has stronger/larger activations), and the mean of means is considerably larger (Bartowski: 1.94, Ubergarm: 2.31, Unsloth: 2.0)
Ignoring the basic stats (min, max, mean, std dev, etc), I find that the sum of activations (bias) is the most useful metric to select which layers to up/down quantize, as it yields the lowest PPL compared to using ZD or CosSim, or at least that's how I'm reading the tea leaves. All the models in my HF repo have now been generated that way.
For larger models (30B+) however, I'm looking at a different approach by combining layer-wise with pruning. That PR is in draft as I'm testing it works as expected with split gguf files, but give it a try if you have some spare time.
For pruning, it's looking like CosSim would be the better way to identify layers to remove. I'll push a new version of this PR with added functionality in a few days.
Since you have a good set of test results for Qwen3-30B-A3B, I'll produce layer-wise and a layer-wise+pruned versions for an apples-to-apples comparison.
@ubergarm, just finished uploading Qwen3-30B-A3B-GGUF. Summary of scores in the model card, and actual results in the scores folder.
A few things to consider:
- got a case of stubborn tensors 🙂: block 43 ffn_down_exps, ffn_gate_exps and ffn_up_exps refused to be activated by my calibration file, hence the somewhat smaller imatrix size compared to the ones used in your tests (114MB vs 116MB). I'll improve the calibration and will retry.
- for reference, the mix used to generate Q4_K_M was
--token-embedding-type q3_k --output-tensor-type q4_k --tensor-type "\.([0-9]|1[0-9]|2[0-3])\.attn_k=q3_k" --tensor-type "\.([0-9]|1[0-9]|2[0-3])\.attn_q=q3_k" --tensor-type "\.([0-9]|1[0-9]|2[0-3])\.attn_v=q4_k" --tensor-type attn_v=q5_k --tensor-type "\.([0-9]|1[0124]|1[6-9]|2[0-4]|26)\.ffn_gate_exps=q3_k" --tensor-type "\.([0-9]|1[0124]|1[6-9]|2[0-4]|26)\.ffn_up_exps=q3_k" --tensor-type ffn_down_exps=q5_kwhich corresponds to down-quantizing blocks with the lowest Σ(Bias) per tensor-type - dump of the gguf structure in the scores folder as well
- the above mix resulted in a ~8% smaller file compared to naive, with a PPL of 99.07%
- I'll upload the Q4_K_M scores for Bartowski and Unsloth when I get a chance
- for large param models (30B+), it seems the best size reduction, compared to an equivalent naive quant, is below 10% hence exploring the pruning route. Will try upload pruned models in the next couple of weeks
@ubergarm, just finished uploading Qwen3-30B-A3B-GGUF. Summary of scores in the model card, and actual results in the scores folder.
A few things to consider:
* got a case of stubborn tensors 🙂: block 43 ffn_down_exps, ffn_gate_exps and ffn_up_exps refused to be activated by my calibration file, hence the somewhat smaller imatrix size compared to the ones used in your tests (114MB vs 116MB). I'll improve the calibration and will retry. * for reference, the mix used to generate Q4_K_M was `--token-embedding-type q3_k --output-tensor-type q4_k --tensor-type "\.([0-9]|1[0-9]|2[0-3])\.attn_k=q3_k" --tensor-type "\.([0-9]|1[0-9]|2[0-3])\.attn_q=q3_k" --tensor-type "\.([0-9]|1[0-9]|2[0-3])\.attn_v=q4_k" --tensor-type attn_v=q5_k --tensor-type "\.([0-9]|1[0124]|1[6-9]|2[0-4]|26)\.ffn_gate_exps=q3_k" --tensor-type "\.([0-9]|1[0124]|1[6-9]|2[0-4]|26)\.ffn_up_exps=q3_k" --tensor-type ffn_down_exps=q5_k` which corresponds to down-quantizing blocks with the lowest Σ(Bias) per tensor-type * dump of the gguf structure in the _scores_ folder as well * the above mix resulted in a ~8% smaller file compared to naive, with a PPL of 99.07% * I'll upload the Q4_K_M scores for Bartowski and Unsloth when I get a chance * for large param models (30B+), it seems the best size reduction, compared to an equivalent naive quant, is below 10% hence exploring the pruning route. Will try upload pruned models in the next couple of weeks
@EAddario
Had/have same issue with almost all the Qwen 30B-A3B (and fine tunes - but less of an issue on abliterated / uncensored) ; also the Qwen3 16B-A3B is much easier to work with .
Resolve: Increase the number of active experts, then calb / imatrix.
This issue seems to affect Qwen3 moes, Llama 4 moes, and maybe an issue with new Granite 4 Moe(s). Issue seem to stem from one or more:
- number of experts VS active experts.
- size of experts (sub 1B) ?
- Imatrix calb file -> even my best need a lot of experts on or huge imatrix dataset/files.
Added weighted statistics per layer (as opposed to per tensor) for Σ(Bias), ZD and CosSim.
Whereas per-tensor statistics are helpful to identify which tensors to up/down quantize when using the llama-quantize --tensor-type option, the per-layer seem to be useful to guide pruning (PR #13037).
For example using llama-imatrix --show-statistics on Bartowski's Qwen_Qwen3-30B-A3B.imatrix the report will also include the following:
Computing weighted statistics per layer (48 layers)
Layer Σ(Bias) ZD CosSim
===============================================
0 6061.60 1.7410% 0.0000
1 11640.92 0.5245% 0.5026
2 28831.86 0.0572% 0.0268
3 23922.71 0.1373% 0.0164
4 24552.72 0.7992% 0.1391
5 26268.89 1.0561% 0.1228
6 31291.89 0.6889% 0.0804
7 31656.05 0.7584% 0.0716
8 31301.72 0.5691% 0.1416
9 32720.18 0.6998% 0.1249
10 32996.34 0.8920% 0.0973
11 35464.12 1.1090% 0.0915
12 36960.77 1.0656% 0.1651
13 41183.71 1.1070% 0.1192
14 42311.34 0.8827% 0.0878
15 47201.43 0.7962% 0.1710
16 48093.70 1.2141% 0.1126
17 48655.36 1.0467% 0.1230
18 61740.04 0.6507% 0.0841
19 56956.36 0.8035% 0.0679
20 53476.46 0.9887% 0.2341
21 51980.02 0.7850% 0.1135
22 57266.32 0.2736% 0.0519
23 60166.23 1.3932% 0.0503
24 61621.80 1.1692% 0.0996
25 67485.11 0.3240% 0.0499
26 62086.84 1.4648% 0.0541
27 68444.67 1.1134% 0.1000
28 72305.86 1.0958% 0.2497
29 73116.39 0.8709% 0.0691
30 85119.15 0.9191% 0.0798
31 79558.44 1.0303% 0.1371
32 76364.62 0.6333% 0.2268
33 77971.35 0.7549% 0.1053
34 91862.55 0.8468% 0.2912
35 91125.50 1.3788% 0.0777
36 93407.48 1.1753% 0.1083
37 108214.91 1.3324% 0.1090
38 114657.11 1.4874% 0.1249
39 135057.31 1.0944% 0.1257
40 157113.47 1.2505% 0.2380
41 179461.16 1.1735% 0.1400
42 209527.08 0.9848% 0.0949
43 231601.23 1.5293% 0.1263
44 253612.80 1.6118% 0.2824
45 284678.88 1.4165% 0.2286
46 315767.62 1.2535% 0.2478
47 461904.62 0.7146% 0.2353
Apologies for shotgun approach @slaren / @CISC / @ggerganov, not sure what the proper process to request a review is. Happy to close or move to draft if it's not suitable for merging
@compilade, would love to see #9400 being merged into master! It will open some really interesting possibilities, like being able to store tensors' state alongside the activations, for example. That would allow for more powerful stats, a clean way to test different imatrices, etc. Coding skills are probably not up to scratch, but would be happy to lend a hand
Hi @compilade, good to go? or should I change something else?
The only thing that's bothering me is that the stats aren't calculated on the actual activations
Agree 100% and no doubts #9400 will provide a neat way to address this. In the meantime, I'll update the README.md file over the weekend to document what the new option is actually doing and what the calculated stats really mean, and will then re-request a review
@compilade, I've updated the README file to reflect current limitations when calculating the stats, and made a note to reimplement/improve the functionality once #9400 is merged. Until then, completely up to you to merge now, or wait until #9400 is in place.
@compilade / @CISC, I have finished merging and testing. As far as I can tell, I think is good to go. I don't plan to make any more changes on this PR
Thank you @CISC and @compilade