grimulkan
grimulkan
From my limited understanding, the authors claim that trying to use NTK-alpha scaling effectively extrapolates some dimensions, unlike linear scaling which never does. This, they say, Is why it is...
With Llama 405B there are many layers, and with ring sizes of 4 or 8 the numerical errors become catastrophic in backward. The errors actually originate in the forward pass...
Any interest in re-opening, now that we have DS-R1?
https://github.com/ggml-org/llama.cpp/issues/7343 is what is going on I think