Chris Dryden issues

Results 10 issues of


Chris Dryden

Updated Matmul block size for .6ms speedup

Went through all of the blocks on the A100 to see if any were not tuned correctly for an A100 and found that this block size had a stunning 2%...

Experimenting with global instantiation for the layouts

I was able to see a speed increase, needs to be cleaned up and refactored substantially but good to see what a potential speedup would be

Removed cooperative groups in softmax_autoregressive_backward_kernel

Following the same pattern described in: https://github.com/karpathy/llm.c/issues/292 Had to map that: ``` __syncthreads() == block.sync() ``` and ``` warp.thread_rank() == warpId * warpSize + laneId ``` Performance before: block_size 32...

Updated adamw to use packed data types

Before Runtime total average iteration time: 38.547570 ms After Runtime: total average iteration time: 37.901735 ms Kernel development file specs: Barely noticeable with the current test suite: Before: time gpu...

gelu_backwards cuda dev file and float4 dtype for parrallel memory read

This CR implements a cuda dev file for gelu_backwards and adds a second kernel that implements the float4 dtype. The session to implement this was recorded here: https://youtu.be/eOOjKTlLY-s This was...

Splitting cuda dev files to use smaller sizes for cpu validation compared to profiling

Trying to come up with some examples of beginner friendly issues that would be helpful to the development effort: Currently when profiling the CUDA kernels the first step is to...

Modified version of ademeure's fused gelu_forward kernel

Was experimenting with the fused gelu kernel to combine it to have the previous code when working with non-gelu matmuls that was built previously and when running it locally it...

2D and 3D tile divisions so that permutation coordinates can be read from threadIdx and blockIdx

Supposedly the permutation kernels, even though they are mostly memory bound can reduce the amount of division and do thread coarsening by having a 2d or 3d grid and not...

Add Mobile Support

Website blocked on work network, mobile support is imperative

AdamW constant pre-calculation outside of kernel

This PR does the precalculations all of the constants multiplications with one another and reduces the amount of parameters required to be passed into the adamw kernel. This PR also...