Andrej comments

Results 373 comments of


                                            Andrej

Fix incorrect GPU assignment in multi gpu setup

Confirm this fixes the issue, stepping at ~460K tok/s on 4XA100 GPUs.

simplify usage of cudnn

Hello ty for the PR, I'm not an expert in cudnn use do you have a short explanation for some of these changes? Also I noticed you edited the dev/cuda...

simplify usage of cudnn

Running the test with this PR `make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu` actually fails, and specifically the error on `qkvw` tensor grows from 1.1e-1 to 1.4e-1. So we'd have to dumb...

simplify usage of cudnn

This was not flagged by our CI because I think it does not turn on `USE_CUDNN=1` in the `make` command.

simplify usage of cudnn

Sorry for spam, I noticed that it's not this PR that is "flipping" the test from FAIL to PASS, it's the way we compile, without the use of `USE_CUDNN=1`. Master...

NCCL only multi-gpu multi-node training without MPI

Very cool! I'll take a look and also see if I can find a slurm cluster to play with this on. Do you by any chance have a PyTorch baseline...

NCCL only multi-gpu multi-node training without MPI

Thank you for posting @chinthysl , very cool. We had a small discussion about it on our Discord with the core devs, please join us sometime on the [CUDA MODE](https://github.com/cuda-mode)...

[dev/cuda] Include a matmul_backward_bias kernel based on PMPP CoarsenedSumReduction kernel in 10.15

Why delete 768 block size

Adding Mersenne Twisters Impl.

Nice! This is actually super convenient because it may mean that we could have tests for our training matching that of PyTorch from scratch, without having to save/load checkpoints. We...

Use proper GeLU on CPU

Hi @jart it's nice to see you stop by! I don't think I can merge this because (for educational and historic reasons) I am trying to be compatible with GPT-2...