Andrej comments

Results 373 comments of


                                            Andrej

Implement's torch SDPA for FlashAttention Kernel

I'm seeing slightly different results: ``` (y-y2).abs().max() 0.0078 ``` slightly unsettling. Any idea where this is from?

Implement's torch SDPA for FlashAttention Kernel

I also get a lot of really scary warnings from torch.compile ...

Implement's torch SDPA for FlashAttention Kernel

Heads up I merged a slight modification in this commit: https://github.com/karpathy/nanoGPT/commit/ae06d0b15a9111cbe2ce66b0f1be9ae29c1ecbbe Let me know if any comments

Implement's torch SDPA for FlashAttention Kernel

@drisspg it's much worse than that. Just running `train.py` prints: ``` compiling the model... (takes a ~minute) [2023-01-30 23:47:24,269] torch._inductor.graph: [WARNING] Creating implicit fallback for: target: aten._scaled_dot_product_efficient_attention.default args[0]: TensorBox( PermuteView(data=View(...

Implement's torch SDPA for FlashAttention Kernel

this was merged now so closing the issue

Update dependencies

I don't understand what's happening here, where is the error coming from?

Update dependencies

something can't be right here. how is tensorflow even involved?

Should `loss` be divided by `gradient_accumulation_steps`?

You're right it should be, I'll issue a fix. One thing to note is that this is less of a bug than it appears to be because Adam is scale...

Model.py simplifications

I only like some of these 😂

Add streaming output with minimal intervention.

agree