fasil-saidalavi comments

Repositories
Issues
Comments

Results 3 comments of


                                            fasil-saidalavi

Differential Transformer loss spikes while training.

Hai @YTianZHU, - I used same FFN implementation for both Diff and Transformer - attention implementation of Diff Transformer class TEDotProductDiffAttention(te.pytorch.DotProductAttention): cp_stream: torch.cuda.Stream = None def __init__( self, config: TransformerConfig,...

Differential Transformer loss spikes while training.

@YTianZHU Hai, I just used this code from the official implementation and made changes so that the diff attention work in my training framework. before training i verified that both...

Differential Transformer loss spikes while training.

@YTianZHU Hi, what precision did you use for training, especially for the 3B model with 1T tokens?