Horace He comments

Results 242 comments of


                                            Horace He

Fixing quantize in int4 mode

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the...

[example] Added (hacky) Grok1 support

@merrymercy ah that would explain my results haha. Thanks!

INT4 quantization not working on MI210

Yeah, int4 quantization doesn't work on AMD GPUs right now.

Question about large sequence length attention kernels

This is a good question! I think there's two components of this question: 1. The default FlashAttention kernel is not very performant for decoding. See https://pytorch.org/blog/flash-decoding/ for more detail. 2....

What happens to bias during int8 quantization?

You don't need to quantize it. The weight matrix is say, 4096 x 4096. The bias matrix is just another 4096 elements, so 0.02% of the size.

Update to use torch.nn.attention.sdpa_kernel

The main reason I didn't do this previously is worry that this'll cause the code to hard break on older versions. When was this new API added?

[example] Added gemma support

I added support for gemma-7b. The main non-trivial component here was that `head_dim * n_heads != dim`, so some parts of the model definition needed to be patched. I'm getting...

Question about the gennerated code of `WeightOnlyInt8Linear`

Yes, you need to add `coordinate_descent_tuning` to be True. ```python import torch import torch.nn as nn import torch.nn.functional as F import torch._inductor.config torch._inductor.config.coordinate_descent_tuning = True D = 8192 def bench(f,...

Question about the gennerated code of `WeightOnlyInt8Linear`

I think we can actually relax `coordinate_descent_tuning`, although we still need BS=1 restriction.

Question about the gennerated code of `WeightOnlyInt8Linear`

https://github.com/pytorch/pytorch/pull/120954 This PR always turns on the decomposition.