unilm icon indicating copy to clipboard operation
unilm copied to clipboard

Ablation tests with the same headdim as v with differential transformers

Open RuiWang1998 opened this issue 7 months ago • 3 comments

As I understand it, headdim is more important than the number of heads, and the diff transformer chooses to half the number of heads and double the vdim compared to normal transformers.

However, wouldn't it also make sense to compare against a baseline with the double the headdim while halving the head number? In this case, the baseline should be much stronger and results in, hopefully, an even stronger argument for diff transformer.

Concretely, authors in the paper compared diff transformer (128x12) against 128x24 vanilla transformer (headdim x numheads), while I would be curious about a baseline of 256x12. In practice these models should consume around the same computational resources while 256x12 model should results in a better result.

RuiWang1998 avatar May 13 '25 05:05 RuiWang1998

Hi @RuiWang1998 , sorry for the late response. Please refer to our ablation studies in Table 6 of our paper, where in the second row we report the result of a half-heads-double-headdim Transformer. There is no much difference between the standard Transformer and it.

YTianZHU avatar May 30 '25 03:05 YTianZHU

Hi @YTianZHU ,

Thanks for the response! My bad I somehow skipped that row.

If I understood correctly, this could mean that 256 is somewhat redundant in this case?

Also, is it possible to integrate fa3 backward into flex_head_fa?

Best

RuiWang1998 avatar Jun 04 '25 06:06 RuiWang1998

@RuiWang1998 Our observation is that as the the number of heads and head_dim get larger, different combinations of them do not make very much impact. Currrently, the integration of fa3 backward for a large head_dim is not feasible. We will wait for updates of fa3 and when it is possible, we will try to integrate that.

YTianZHU avatar Jun 12 '25 06:06 YTianZHU