Aroun Demeure
Aroun Demeure
This improves performance on my local RTX 4090 from ~65ms to ~34ms (while pyTorch takes ~36ms!) **ORIGINAL**: step 1: train loss 4.406481 (took 64.890952 ms) **OPTIMISED**: step 1: train loss...
This replaces the memory allocation for activations to use compressible memory (available on Hopper and Ada Lovelace only). It improves performance from 42.1ms to 39.2ms on RTX 4090, but has...
This is a faster version of the cool new kernel from #117 (still /dev/cuda/ only). The biggest difference is it is optimised for doing one row per 1024-wide block rather...
Refactoring and removing unused functions to reduce the number of lines of code and make everything slightly more consistent (while still having space for the code to breathe). Also updates...
It turns out that not only is cuBLASLt not able to fuse BF16 GELU (or RELU) into a BF16 matmul, it also ends up with a strange kernel that is...
These are fairly difficult optimisations to describe, hopefully the comments are helpful/enough! I'd focus on the changes in train_gpt2.cu rather than the similar ones in /dev/cuda/ (I didn't include a...
This causes a small ~0.3% performance loss on my RTX 4090, possibly worse on a A100 since that kernel might be a slightly larger % of total runtime. It does...
This is a complete rewrite of the encoder backward pass, splitting it into two kernels (wte and wpe) which are both fully deterministic as they do not use atomics (assuming...
This is the current state of my FP8 branch, it's far from ready, but it's at the point where you could take a look if you're curious! The last version...
This adds a '-rg' parameter to manually set the RNG seed. This is useful to see if a change is beneficial or not when the difference is potentially real but...