Aroun Demeure issues

Results 19 issues of


                                            Aroun Demeure

~2x perf improvement beating PyTorch (cublasLt, TF32, CUDA graphs, kernel fusion, etc…)

This improves performance on my local RTX 4090 from ~65ms to ~34ms (while pyTorch takes ~36ms!) **ORIGINAL**: step 1: train loss 4.406481 (took 64.890952 ms) **OPTIMISED**: step 1: train loss...

CUDA lossless compressible memory for activations

This replaces the memory allocation for activations to use compressible memory (available on Hopper and Ada Lovelace only). It improves performance from 42.1ms to 39.2ms on RTX 4090, but has...

Optimised version of fused classifier + bugfixes(?)

This is a faster version of the cool new kernel from #117 (still /dev/cuda/ only). The biggest difference is it is optimised for doing one row per 1024-wide block rather...

Refactoring & Improvements to reduce LOC

Refactoring and removing unused functions to reduce the number of lines of code and make everything slightly more consistent (while still having space for the code to breathe). Also updates...

GELU Fusion with cuBLASLt (SLOWER because it only merges in FP16 mode, not BF16/FP32...)

It turns out that not only is cuBLASLt not able to fuse BF16 GELU (or RELU) into a BF16 matmul, it also ends up with a strange kernel that is...

Optimisations for layernorm_backward / matmul_backward_bias / fused_classifier

These are fairly difficult optimisations to describe, hopefully the comments are helpful/enough! I'd focus on the changes in train_gpt2.cu rather than the similar ones in /dev/cuda/ (I didn't include a...

Fully deterministic layernorm backward

This causes a small ~0.3% performance loss on my RTX 4090, possibly worse on a A100 since that kernel might be a slightly larger % of total runtime. It does...

Fully deterministic encoder backward kernels

This is a complete rewrite of the encoder backward pass, splitting it into two kernels (wte and wpe) which are both fully deterministic as they do not use atomics (assuming...

FP8 work in progress

This is the current state of my FP8 branch, it's far from ready, but it's at the point where you could take a look if you're curious! The last version...

Set RNG seed manually with '-rg' parameter

This adds a '-rg' parameter to manually set the RNG seed. This is useful to see if a change is beneficial or not when the difference is potentially real but...