Horace He
Horace He
Yes. GPU power limit is an unfortunate limitation of the particular hardware setup I'm using - it's not required.
I would perhaps suggest this video giving an overview of TorchInductor: https://www.youtube.com/watch?v=p13HpZv2S3Q Another thing you can check out is `TORCH_LOGS="output_code"`, which'll show you the actual triton kernels that are generated....
What command are you running to get this error?
`F.scaled_dot_product_attention` automatically makes a decision about what backend to dispatch to. For example, it can choose to dispatch to the FlashAttention2 kernel. Or, for example, on platforms where FlashAttention2 is...
The big issue is the work partitioning structure. FlashAttention parallelizes among heads, BS, and output_seq_len (i.e. seq_query). In this case, BS and `output_seq_len` is 1, so the only parallelism is...
> The function decode_n_tokens, in which the torch.backends.cuda.sdp_kernel decorator is used, is not compiled. Does that mean the aforementioned behavior is not applied? No, `decode_n_tokens` calls `decode_token`, which does have...
What is a "normal implementation" of the model? To be clear, the metric reported here is also sometimes called "tokens per second per user" (i.e. the latency for a single...
@yifuwang I think the right way to handle this is that we should compile once for all ranks, and then re-use the graph on all ranks.
Putting this inside `vim.otherModesKeyBindings` is a good temporary fix if this is a big issue for you: ``` { "before": [ "u" ], "after": [], "commands": [ { "command": "undo"...
@petejkim You're correct! Thanks!