Tianzhu Ye comments

Results 10 comments of


                                            Tianzhu Ye

Is there any torch library based Differential Transformer code?

Hi @DevKiHyun You can refer to Section 3.1 and Appendix D in our paper for detailed configurations of our models. You can also directly use configs of open-sourced LLMs and...

About Pretraining on Diff Attention

Hi, our training corpus follow StableLM https://aka.ms/StableLM-3B-4E1T You can also use any datasets you like to train and compare Diff with baseline Transformer, the results should be similar.

the code of Differential Transformers training

Hi @mucunxie , the basic training code for DIFF is similar to the code provided at https://aka.ms/yoco , you can make a few changes and merge DIFF code into it....

the code of Differential Transformers training

@Adamyangs Hi, seems you use half head dimension for Diff. We suggest using same head dimension but half number of heads. For e.g., a Transformer has 16 heads and head...

the code of Differential Transformers training

@Adamyangs, Hi, 1. We still take a Transformer with 16 heads and head dimension of 128 as an example. Seems your setting uses 64 head dimension for q1, k1, q2,...

Differential Transformer loss spikes while training.

Hi @fasil-saidalavi , sorry for the late response. Do you use same FFN implementation for both Diff and Transformer? Would you post a code snippet of attention implementation of both...

Differential Transformer loss spikes while training.

@fasil-saidalavi Hi, in DIFF, q1k1 and q2k2 share the same value in a DIFF attention head. For e.g., the q1 q2 k1 k2 has 64 headdim, than the v should...

Ablation tests with the same headdim as v with differential transformers

Hi @RuiWang1998 , sorry for the late response. Please refer to our ablation studies in Table 6 of our paper, where in the second row we report the result of...

Ablation tests with the same headdim as v with differential transformers

@RuiWang1998 Our observation is that as the the number of heads and head_dim get larger, different combinations of them do not make very much impact. Currrently, the integration of fa3...

Q，K的分割问题

Hello, intuitively this doesn't align with the original intention of differential attention, as we aim to achieve denoising while keeping the N-dim unchanged. However, this approach might inspire new denoising...