Kevin Yin

Results 40 comments of Kevin Yin

FlashAttention makes use of the causal mask to do half the work, so one of my friends got >100% MFU when using the 12 factor rather than 7. Common options...

> I checked some other mainstream repos on how MFU is computed. From what I can tell, most (if not all) of them are using 12. For example: If you...

> https://borgbackup.readthedocs.io/en/stable/changes.html#pre-1-2-5-archives-spoofing-vulnerability-cve-2023-36811 In this page, the period at the end of the "BORG_WORKAROUNDS=ignore_invalid_archive_tam" command should be removed.

n_layer: 12 n_head: 12 kv_heads: 6 (GQA) hidden_dim: 1536 n_tokens: 2048 (context length) vocab_dim: 65536 activation: "swiglu" No AMP/gradscaler. If a profile would help, I can produce one.

These `trace.json` files were gigantic (multi-GB), so here's smaller versions without stack information and on two steps only: https://drive.google.com/file/d/1c2CQST_U_Qf6O1qgSXr0DKVorytoNgp5/view?usp=sharing https://drive.google.com/file/d/1NCi6frLbXfVhL0pzdmwoRzmYLtFK-qeQ/view?usp=sharing Torch.compile mode is the default (not "reduce-overhead") and my TFLOPS...

Turns out the step count isn't an issue with stack information disabled. Here's 3 steps: Unfused: https://drive.google.com/file/d/1OKuJVC3PPK5vn6-TWdi1vPD8bAoHP7h4/view?usp=sharing Fused: https://drive.google.com/file/d/1CUeKkwiOMEHRJaNZlC65utq6WuS7Fxh4/view?usp=sharing

No: torch.compile is wrapping forward and loss, but backward and AdamW are outside the compile. So it's: torch.compile(forward+loss but not backward+AdamW) + un-fused AdamW vs torch.compile(forward+loss but not backward+AdamW) +...

My CUDA version is 12.2, and PyTorch only lists 12.1 in its nightlies. Is that still ok for me to install?

It broke when trying to run with PyTorch Nightly, mismatched CUDA versions. ``` /home/kevin/.local/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and...

Here's PyTorch nightly profiles (CUDA 12.2, 2.3.0a0+3eb322f). torch.compile broke, so I turned it off for forward+loss. Performance is much closer; unfused is only a little bit faster than fused. https://drive.google.com/file/d/1k8zoSGeK7Pr5jst_MU4u6HJS6lhikUcl/view?usp=sharing...