Horace He

Results 242 comments of Horace He

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the...

@merrymercy ah that would explain my results haha. Thanks!

Yeah, int4 quantization doesn't work on AMD GPUs right now.

This is a good question! I think there's two components of this question: 1. The default FlashAttention kernel is not very performant for decoding. See https://pytorch.org/blog/flash-decoding/ for more detail. 2....

You don't need to quantize it. The weight matrix is say, 4096 x 4096. The bias matrix is just another 4096 elements, so 0.02% of the size.

The main reason I didn't do this previously is worry that this'll cause the code to hard break on older versions. When was this new API added?

I added support for gemma-7b. The main non-trivial component here was that `head_dim * n_heads != dim`, so some parts of the model definition needed to be patched. I'm getting...

Yes, you need to add `coordinate_descent_tuning` to be True. ```python import torch import torch.nn as nn import torch.nn.functional as F import torch._inductor.config torch._inductor.config.coordinate_descent_tuning = True D = 8192 def bench(f,...

I think we can actually relax `coordinate_descent_tuning`, although we still need BS=1 restriction.

https://github.com/pytorch/pytorch/pull/120954 This PR always turns on the decomposition.