Andrej comments

Results 373 comments of


                                            Andrej

Zero Redundancy Optimizer - Stage1

Hi @chinthysl heads up that I just merged a PR to (optionally) keep master weights in fp32. I think this impacts this PR https://github.com/karpathy/llm.c/pull/328 eager to merge this one though!

Zero Redundancy Optimizer - Stage1

Some notes on this PR from exploration on my GPU box My current default "go to" run is this 124M model configuration: ```bash make train_gpt2cu USE_CUDNN=1 mpirun -np 4 ./train_gpt2cu...

Zero Redundancy Optimizer - Stage1

Ok finally had time to step through in detail, LGTM ty.

feat(attention_forward.cu): Gentle introduction to CuTe(cutlass)

@FeSens can you post what kind of perf you're seeing for this?

Removed a few unnecessary heap allocations

i like the allocations fix but not sure about the types fix

Input token length question

This issue is about that: https://github.com/karpathy/llm.c/issues/146 Right now we always forward B * T tokens in a single, fixed, batch configuration that never changes. In principle you can dynamically lower...

fp16 buffers for ADAM

I think I'm missing a bit of context on this PR. Is this following some paper / approach?

init from scratch

Sorry to clarify I want to delete the need for Python in this repo. It's a nice to have for correctness checks but shouldn't be required. Right now it outputs...

Flashattention

This is very cool work!! Questions: - there are mallocs inside the kernel launch, I'm guessing in the actual implementation we'd treat these as buffers and make them part of...

Flashattention

Hi @kilianhae & @simonguozirui , note that we merged to master the cudnn flash attention here today: https://github.com/karpathy/llm.c/pull/323 so this becomes the baseline to beat!