Andrej
Andrej
Also another reason this code will fail is that it hardcodes max context length to be 2048: ```python while input_ids.shape[-1] > 2048: ``` but e.g. GPT-2 has max context length...
I noticed one more bug The issue is that there is a double space in the intro line, right before the subject. This is because ```python def format_subject(subject): l =...
It's def on my todo list to incorporate FSDP into nanoGPT but I haven't looked into it in detail just yet. I also know that FSDP internals are being actively...
In lines like ` const size_t N = (size_t)(B) * T * V;` is the explicit cast needed?
Sorry I meant the casts look ugly to my eye. Maybe we could make the individual params `size_t` in the function declarations 🤔, so their products will come out `size_t`...
oops this PR now conflicts because I merged the other one. Sounds good, agree it is ok to skip += here, but I think it should come with a comment...
we can't just malloc on repeat, without free. maybe memset to zero if needed?
I merged the previous PR, so this one should be ready. ACK on using `=` instead of `+=` in the backward pass. I didn't even realize originally that this would...
Also one possible request - I think a lot of people will come dev/cuda to learn CUDA. If you're able to comment some of the kernels I think it could...
So cool, I went down from 400ms/iter ->200ms/iter.