Andrej
Andrej
Confirm this fixes the issue, stepping at ~460K tok/s on 4XA100 GPUs.
Hello ty for the PR, I'm not an expert in cudnn use do you have a short explanation for some of these changes? Also I noticed you edited the dev/cuda...
Running the test with this PR `make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu` actually fails, and specifically the error on `qkvw` tensor grows from 1.1e-1 to 1.4e-1. So we'd have to dumb...
This was not flagged by our CI because I think it does not turn on `USE_CUDNN=1` in the `make` command.
Sorry for spam, I noticed that it's not this PR that is "flipping" the test from FAIL to PASS, it's the way we compile, without the use of `USE_CUDNN=1`. Master...
Very cool! I'll take a look and also see if I can find a slurm cluster to play with this on. Do you by any chance have a PyTorch baseline...
Thank you for posting @chinthysl , very cool. We had a small discussion about it on our Discord with the core devs, please join us sometime on the [CUDA MODE](https://github.com/cuda-mode)...
[dev/cuda] Include a matmul_backward_bias kernel based on PMPP CoarsenedSumReduction kernel in 10.15
Why delete 768 block size
Nice! This is actually super convenient because it may mean that we could have tests for our training matching that of PyTorch from scratch, without having to save/load checkpoints. We...
Hi @jart it's nice to see you stop by! I don't think I can merge this because (for educational and historic reasons) I am trying to be compatible with GPT-2...