Chris Dryden
Chris Dryden
A group of students from a university course was hoping to take this on for an open-source assignment, could we take this on?
Putting in here a summary of the discussion of the result of this CR: https://github.com/karpathy/llm.c/pull/221 There is a signifigant slowdown on the order of magnitude of 32x for using doubles...
Just posting some notes here on my research of how to remove all of the CG related code to remove the dependency: ``` sum = cg::reduce(warp, sum, cg::plus{}); ``` Can...
Updated the PR to show the new kernel, it does have a speedup in the train loop for me of: total average iteration time: 38.287047 ms to total average iteration...
Was trying to go into the theory of why this is the case and saw the comment: ``` // and atomically add everything together. atomics within one block are conflict-free!...
All of the cooperative groups were removed in another PR
Would it be possible to also add the commands with the params used in the profiling script to do the comparison, I have access to run it on a H100...
I was running with the A100 in the dev kernel comparing the 4th kernel to the 6th kernel with the default params. With B=1 and T=8192, C = 768, NH...
Where this came up in discussion was regarding the possibility of adding all of the constants that can be passed into the kernel directly, such as the following values: https://github.com/karpathy/llm.c/blob/master/train_gpt2.cu#L689...
Created an example implementation here: https://github.com/karpathy/llm.c/pull/459 but it doesn't seem to be working properly