Jesse Cai comments

Results 49 comments of


                                            Jesse Cai

`int4_weight_only` Slows Down `torch.nn.Linear` for Llama2 7B Shapes

I also see slowdowns on my A100, not sure of the exact cause. Maybe there were some changes to the int4 kernel in core? I also see you're running without...

[feat] int8 flash attention

cc @cpuhrsch @HDCharles I think we could do this with flexattention? Flagging just so you are aware there's interest.

Paged attention

cc @liangan1 @HDCharles what's the status of this PR - do we need additional work to land?

What is the difference between WeightNormSparsifier and torch.nn.utils.prune.l1_unstructured ?

Hey @mayank64ce `torch.nn.tils.prun.l1_unstructured` is no longer maintained, so I would recommend using the `WeightNormSparsifier`. The sparsifier also allows for more configs, like block_size, or intra block sparsity. Functionally however, they...

Accelerate activation sparsity with activation compression

@agrawal-aka Yes that's correct, we have a max 2x acceleration with 2:4 sparsity at 50%, but theoretically we can push this higher. The difficulty with unstructured sparsity is that 1)...

Accelerate activation sparsity with activation compression

@agrawal-aka > Could you clarify at what point in the forward pass the compression and subsequent decompression should occur? From my understanding, activation compression would be of minimal use during...

Introduce new W8A8-FP-CSR quantitzation API

cc @namgyu-youn Can you split this into two PRs? one for int8 and one for float8? In general I don't think we want to introduce weight-only sparsity configs for int8...

Introduce new W8A8-FP-CSR quantitzation API

cc @namgyu-youn I talked to @bbeckca and I think your PR is closer so lets use it instead. Can you remove the int8 changes then and I will give this...

Introduce new W8A8-FP-CSR quantitzation API

@namgyu-youn I think it'll be easier for me to just migrate this over, mind if I take over the PR? #3182 is also quite far from landing.