Mayank Mishra comments

Results 187 comments of


                                            Mayank Mishra

Add custom ops for compatibility with PT Compile

aah, yikes @ani300 I had started working on the same thing https://github.com/Dao-AILab/flash-attention/pull/1145 😓 Ill let you handle this 😃

Add custom ops for compatibility with PT Compile

@GLivshits I dont think it can be handled in older versions of torch

Add custom ops for compatibility with PT Compile

@tridao @ani300 is there any progress/updates on this? Its a pretty neat feature to have Flash Attention fully end-to-end traceable natively.

Gradient accumulation yields worse results than the equivalent batch size

Hey, this is expected behaviour. FSDP-1 only allows accumulation in 16-bit precision. This is not the case for FSDP-2 which allows accumulation in both 16-bit and 32-bit.

Gradient accumulation yields worse results than the equivalent batch size

documentation for FSDP-1: documentation for FSDP-2:

can we [add new feature ]support zero=2,3 with --tensor-model-parallel-size 2 --pipeline-model-parallel-size 2 for pretrain-gpt2?

this project is not really maintained anymore, I suggest other alternatives

Add FlashAttention

Aah here we go. Is flash attention merged into the original repo? I saw Tri Dao had opened a PR