Kevin Yin comments

Results 40 comments of


                                            Kevin Yin

exclude embedding in MFU computation

FlashAttention makes use of the causal mask to do half the work, so one of my friends got >100% MFU when using the 12 factor rather than 7. Common options...

exclude embedding in MFU computation

> I checked some other mainstream repos on how MFU is computed. From what I can tell, most (if not all) of them are using 12. For example: If you...

Data integrity error: Archive authentication did not verify

> https://borgbackup.readthedocs.io/en/stable/changes.html#pre-1-2-5-archives-spoofing-vulnerability-cve-2023-36811 In this page, the period at the end of the "BORG_WORKAROUNDS=ignore_invalid_archive_tam" command should be removed.

AdamW(fused=True) slower than unfused AdamW

n_layer: 12 n_head: 12 kv_heads: 6 (GQA) hidden_dim: 1536 n_tokens: 2048 (context length) vocab_dim: 65536 activation: "swiglu" No AMP/gradscaler. If a profile would help, I can produce one.

AdamW(fused=True) slower than unfused AdamW

These `trace.json` files were gigantic (multi-GB), so here's smaller versions without stack information and on two steps only: https://drive.google.com/file/d/1c2CQST_U_Qf6O1qgSXr0DKVorytoNgp5/view?usp=sharing https://drive.google.com/file/d/1NCi6frLbXfVhL0pzdmwoRzmYLtFK-qeQ/view?usp=sharing Torch.compile mode is the default (not "reduce-overhead") and my TFLOPS...

AdamW(fused=True) slower than unfused AdamW

Turns out the step count isn't an issue with stack information disabled. Here's 3 steps: Unfused: https://drive.google.com/file/d/1OKuJVC3PPK5vn6-TWdi1vPD8bAoHP7h4/view?usp=sharing Fused: https://drive.google.com/file/d/1CUeKkwiOMEHRJaNZlC65utq6WuS7Fxh4/view?usp=sharing

AdamW(fused=True) slower than unfused AdamW

No: torch.compile is wrapping forward and loss, but backward and AdamW are outside the compile. So it's: torch.compile(forward+loss but not backward+AdamW) + un-fused AdamW vs torch.compile(forward+loss but not backward+AdamW) +...

AdamW(fused=True) slower than unfused AdamW

My CUDA version is 12.2, and PyTorch only lists 12.1 in its nightlies. Is that still ok for me to install?

AdamW(fused=True) slower than unfused AdamW

It broke when trying to run with PyTorch Nightly, mismatched CUDA versions. ``` /home/kevin/.local/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and...

AdamW(fused=True) slower than unfused AdamW

Here's PyTorch nightly profiles (CUDA 12.2, 2.3.0a0+3eb322f). torch.compile broke, so I turned it off for forward+loss. Performance is much closer; unfused is only a little bit faster than fused. https://drive.google.com/file/d/1k8zoSGeK7Pr5jst_MU4u6HJS6lhikUcl/view?usp=sharing...