Jesse Cai
Jesse Cai
I also see slowdowns on my A100, not sure of the exact cause. Maybe there were some changes to the int4 kernel in core? I also see you're running without...
cc @cpuhrsch @HDCharles I think we could do this with flexattention? Flagging just so you are aware there's interest.
cc @liangan1 @HDCharles what's the status of this PR - do we need additional work to land?
Hey @mayank64ce `torch.nn.tils.prun.l1_unstructured` is no longer maintained, so I would recommend using the `WeightNormSparsifier`. The sparsifier also allows for more configs, like block_size, or intra block sparsity. Functionally however, they...
@agrawal-aka Yes that's correct, we have a max 2x acceleration with 2:4 sparsity at 50%, but theoretically we can push this higher. The difficulty with unstructured sparsity is that 1)...
@agrawal-aka > Could you clarify at what point in the forward pass the compression and subsequent decompression should occur? From my understanding, activation compression would be of minimal use during...
cc @namgyu-youn Can you split this into two PRs? one for int8 and one for float8? In general I don't think we want to introduce weight-only sparsity configs for int8...
cc @namgyu-youn I talked to @bbeckca and I think your PR is closer so lets use it instead. Can you remove the int8 changes then and I will give this...
@namgyu-youn I think it'll be easier for me to just migrate this over, mind if I take over the PR? #3182 is also quite far from landing.