Yunsheng Ni
Yunsheng Ni
Same question.
I have implemented all the parts and experiments, the strange thing is that the loss of this xray is infinity, I do n’t know what the problem is
我这个代码好久以前的了,我没有复现出来,我可能还得整理一下 > On Dec 27, 2020, at 8:58 PM, ZuoyanL wrote: > > > I have implemented all the parts and experiments, the strange thing is that the loss...
Why do you think that `FMHA_ENABLE` stands for FlashAttention?
I don't think `FMHA_ENABLE` stands for the FlashAttention, it stands for `fused multi-head attention` .  You can see [gpt_guide.md](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md) for more information.
退火阶段的学习率具体是线性,还是指数呢?如果是指数的话,T和N的关系是什么呢? 
same question
> I published some at https://huggingface.co/datasets/malaysia-ai/Flash-Attention3-wheel, > > ## Flash-Attention3-wheel > Flash Attention 3 wheels on commit [0e60e39473e8df549a20fb5353760f7a65b30e2d](https://github.com/Dao-AILab/flash-attention/commit/0e60e39473e8df549a20fb5353760f7a65b30e2d). > > ### Build using H100 > For PyTorch 2.6.0 12.6, 2.7.0...