Chris Dryden comments

Results 68 comments of


                                            Chris Dryden

Fortawesome: treeshaking behavior not working

A group of students from a university course was hoping to take this on for an open-source assignment, could we take this on?

[todo] Accumulate in double instead of float

Putting in here a summary of the discussion of the result of this CR: https://github.com/karpathy/llm.c/pull/221 There is a signifigant slowdown on the order of magnitude of 32x for using doubles...

delete use of cooperative groups in kernels

Just posting some notes here on my research of how to remove all of the CG related code to remove the dependency: ``` sum = cg::reduce(warp, sum, cg::plus{}); ``` Can...

Updated adamw to use packed data types

Updated the PR to show the new kernel, it does have a speedup in the train loop for me of: total average iteration time: 38.287047 ms to total average iteration...

Updated Matmul block size for .6ms speedup

Was trying to go into the theory of why this is the case and saw the comment: ``` // and atomically add everything together. atomics within one block are conflict-free!...

Removed cooperative groups in softmax_autoregressive_backward_kernel

All of the cooperative groups were removed in another PR

Flashattention

Would it be possible to also add the commands with the params used in the profiling script to do the comparison, I have access to run it on a H100...

Flashattention

I was running with the A100 in the dev kernel comparing the 4th kernel to the 6th kernel with the default params. With B=1 and T=8192, C = 768, NH...

2D and 3D tile divisions so that permutation coordinates can be read from threadIdx and blockIdx

Where this came up in discussion was regarding the possibility of adding all of the constants that can be passed into the kernel directly, such as the following values: https://github.com/karpathy/llm.c/blob/master/train_gpt2.cu#L689...

2D and 3D tile divisions so that permutation coordinates can be read from threadIdx and blockIdx

Created an example implementation here: https://github.com/karpathy/llm.c/pull/459 but it doesn't seem to be working properly