flash-linear-attention
flash-linear-attention copied to clipboard
[RFC] Use each model's official initialization instead of a unified initialization
Proposal
Use each model's official initialization instead of a unified initialization
Rationale
Related issue https://github.com/fla-org/flash-linear-attention/issues/220 , https://github.com/fla-org/flash-linear-attention/issues/266
Consider restructuring the code: Use base class FLAModel, FLAForCausalLM which provide the default initialization and forward. Downstream models inherit these base classes and override these functions when necessary.
Changed RWKV7 to official initialization