mergekit Idea: Downscaling the K and/or Q matrices for repeated layers in franken-merges?

Idea: Downscaling the K and/or Q matrices for repeated layers in franken-merges?

Open jukofyork opened this issue 11 months ago • 63 comments

Has anyone tried downscaling the K and/or Q matrices for repeated layers in franken-merges? This should act like changing the temperature of the softmax and effectively smooth the distribution:

Hopfield Networks is All You Need https://arxiv.org/abs/2008.02217 https://ml-jku.github.io/hopfield-layers/

The paper and blog post has a lot of interesting discussion about the effect of Beta, meta-stable states, and a some discussion on multiple update steps in the paper's appendix.

Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models https://arxiv.org/abs/2310.17086

Empirically I've found repeating large blocks does seem to make models "confidently wrong" - stacking two full copies of deepseek-coder or miqu-1 shows this phenomenon really well.

Mar 18 '24 00:03 jukofyork

mergekit mergekit copied to clipboard

Idea: Downscaling the K and/or Q matrices for repeated layers in franken-merges?

mergekit
mergekit copied to clipboard