Andrej

Results 373 comments of Andrej

Hi @chinthysl heads up that I just merged a PR to (optionally) keep master weights in fp32. I think this impacts this PR https://github.com/karpathy/llm.c/pull/328 eager to merge this one though!

Some notes on this PR from exploration on my GPU box My current default "go to" run is this 124M model configuration: ```bash make train_gpt2cu USE_CUDNN=1 mpirun -np 4 ./train_gpt2cu...

Ok finally had time to step through in detail, LGTM ty.

@FeSens can you post what kind of perf you're seeing for this?

i like the allocations fix but not sure about the types fix

This issue is about that: https://github.com/karpathy/llm.c/issues/146 Right now we always forward B * T tokens in a single, fixed, batch configuration that never changes. In principle you can dynamically lower...

I think I'm missing a bit of context on this PR. Is this following some paper / approach?

Sorry to clarify I want to delete the need for Python in this repo. It's a nice to have for correctness checks but shouldn't be required. Right now it outputs...

This is very cool work!! Questions: - there are mallocs inside the kernel launch, I'm guessing in the actual implementation we'd treat these as buffers and make them part of...

Hi @kilianhae & @simonguozirui , note that we merged to master the cudnn flash attention here today: https://github.com/karpathy/llm.c/pull/323 so this becomes the baseline to beat!