fasil-saidalavi

Results 3 comments of fasil-saidalavi

Hai @YTianZHU, - I used same FFN implementation for both Diff and Transformer - attention implementation of Diff Transformer class TEDotProductDiffAttention(te.pytorch.DotProductAttention): cp_stream: torch.cuda.Stream = None def __init__( self, config: TransformerConfig,...

@YTianZHU Hai, I just used this code from the official implementation and made changes so that the diff attention work in my training framework. before training i verified that both...

@YTianZHU Hi, what precision did you use for training, especially for the 3B model with 1T tokens?