Andrej
Andrej
Hi @chinthysl heads up that I just merged a PR to (optionally) keep master weights in fp32. I think this impacts this PR https://github.com/karpathy/llm.c/pull/328 eager to merge this one though!
Some notes on this PR from exploration on my GPU box My current default "go to" run is this 124M model configuration: ```bash make train_gpt2cu USE_CUDNN=1 mpirun -np 4 ./train_gpt2cu...
Ok finally had time to step through in detail, LGTM ty.
@FeSens can you post what kind of perf you're seeing for this?
i like the allocations fix but not sure about the types fix
This issue is about that: https://github.com/karpathy/llm.c/issues/146 Right now we always forward B * T tokens in a single, fixed, batch configuration that never changes. In principle you can dynamically lower...
I think I'm missing a bit of context on this PR. Is this following some paper / approach?
Sorry to clarify I want to delete the need for Python in this repo. It's a nice to have for correctness checks but shouldn't be required. Right now it outputs...
This is very cool work!! Questions: - there are mallocs inside the kernel launch, I'm guessing in the actual implementation we'd treat these as buffers and make them part of...
Hi @kilianhae & @simonguozirui , note that we merged to master the cudnn flash attention here today: https://github.com/karpathy/llm.c/pull/323 so this becomes the baseline to beat!