Tianzhu Ye
Tianzhu Ye
Hi @DevKiHyun You can refer to Section 3.1 and Appendix D in our paper for detailed configurations of our models. You can also directly use configs of open-sourced LLMs and...
Hi, our training corpus follow StableLM https://aka.ms/StableLM-3B-4E1T You can also use any datasets you like to train and compare Diff with baseline Transformer, the results should be similar.
Hi @mucunxie , the basic training code for DIFF is similar to the code provided at https://aka.ms/yoco , you can make a few changes and merge DIFF code into it....
@Adamyangs Hi, seems you use half head dimension for Diff. We suggest using same head dimension but half number of heads. For e.g., a Transformer has 16 heads and head...
@Adamyangs, Hi, 1. We still take a Transformer with 16 heads and head dimension of 128 as an example. Seems your setting uses 64 head dimension for q1, k1, q2,...
Hi @fasil-saidalavi , sorry for the late response. Do you use same FFN implementation for both Diff and Transformer? Would you post a code snippet of attention implementation of both...
@fasil-saidalavi Hi, in DIFF, q1k1 and q2k2 share the same value in a DIFF attention head. For e.g., the q1 q2 k1 k2 has 64 headdim, than the v should...
Hi @RuiWang1998 , sorry for the late response. Please refer to our ablation studies in Table 6 of our paper, where in the second row we report the result of...
@RuiWang1998 Our observation is that as the the number of heads and head_dim get larger, different combinations of them do not make very much impact. Currrently, the integration of fa3...
Hello, intuitively this doesn't align with the original intention of differential attention, as we aim to achieve denoising while keeping the N-dim unchanged. However, this approach might inspire new denoising...