Is there any torch library based Differential Transformer code?
Hi,
Im looking around Differential Transformer paper and code, I found that github version is based on flash attention and rotary embedding.
I wonder that is there any plan to upload simple example with transformer using Diff attention and example argument (ex. adjust num_heads according to original transformer's or other positional embedding...)
Thanks
I find several implements in github by searching Differential Transformer and I'm looking for a implement with static kv_cache and torch.compile for faster inference.
I find several implements in github by searching
Differential Transformerand I'm looking for a implement with static kv_cache and torch.compile for faster inference.
Hi, AnticPan.
Can you share me your findings?
Thanks.
Hi @DevKiHyun You can refer to Section 3.1 and Appendix D in our paper for detailed configurations of our models. You can also directly use configs of open-sourced LLMs and change their model code to turn it into Diff arch.