mergekit
mergekit copied to clipboard
Idea: Downscaling the K and/or Q matrices for repeated layers in franken-merges?
Has anyone tried downscaling the K and/or Q matrices for repeated layers in franken-merges? This should act like changing the temperature of the softmax and effectively smooth the distribution:
Hopfield Networks is All You Need https://arxiv.org/abs/2008.02217 https://ml-jku.github.io/hopfield-layers/
- The paper and blog post has a lot of interesting discussion about the effect of Beta, meta-stable states, and a some discussion on multiple update steps in the paper's appendix.
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models https://arxiv.org/abs/2310.17086
Empirically I've found repeating large blocks does seem to make models "confidently wrong" - stacking two full copies of deepseek-coder
or miqu-1
shows this phenomenon really well.