candle
candle copied to clipboard
Extract RotaryEmbedding code for reuse across models.
trafficstars
Most models use identical of almost identical copies of RotaryEmbedding (cfg.rope_theta vs hardcoded 10000, rope_theta being f32 or f64, chunk() vs 2 calls to narrow() ). A few others (mixformer, phi, chatglm) are a bit more different implementations.
I did not change Yi to use cfg.rope_theta, it had hardcoded 10.000 while the config has 5.000.000 and I can not test this larger model.
Would this make sense elsewhere, like transformers/utils ?