Chris Dryden

Results 10 issues of Chris Dryden

Went through all of the blocks on the A100 to see if any were not tuned correctly for an A100 and found that this block size had a stunning 2%...

I was able to see a speed increase, needs to be cleaned up and refactored substantially but good to see what a potential speedup would be

Following the same pattern described in: https://github.com/karpathy/llm.c/issues/292 Had to map that: ``` __syncthreads() == block.sync() ``` and ``` warp.thread_rank() == warpId * warpSize + laneId ``` Performance before: block_size 32...

Before Runtime total average iteration time: 38.547570 ms After Runtime: total average iteration time: 37.901735 ms Kernel development file specs: Barely noticeable with the current test suite: Before: time gpu...

This CR implements a cuda dev file for gelu_backwards and adds a second kernel that implements the float4 dtype. The session to implement this was recorded here: https://youtu.be/eOOjKTlLY-s This was...

Trying to come up with some examples of beginner friendly issues that would be helpful to the development effort: Currently when profiling the CUDA kernels the first step is to...

Was experimenting with the fused gelu kernel to combine it to have the previous code when working with non-gelu matmuls that was built previously and when running it locally it...

Supposedly the permutation kernels, even though they are mostly memory bound can reduce the amount of division and do thread coarsening by having a 2d or 3d grid and not...

Website blocked on work network, mobile support is imperative

This PR does the precalculations all of the constants multiplications with one another and reduces the amount of parameters required to be passed into the adamw kernel. This PR also...