ao
ao copied to clipboard
Accelerate activation sparsity with activation compression
We've come up with a training recipe for 2:4 activation sparsity, which is outlined in this paper: https://openreview.net/pdf?id=O5feVk7p6Y
The gist of this approach is that:
- we find high level of activation sparsity (> 85%) when training SquaredRELU based FFNs instead of SwiGLU FFNs. These Squared-RELU based FFNs show minimal to no accuracy loss.
- We accelerate the sparse activation x dense weight matmul with 2:4 sparsity. We can naively sparsity for the forwards pass, dropping values to fit the 2:4 constraint if they do not fit. For the backwards pass, we need some special sauce to mantain accuraccy.
However @janeyx99 pointed out to me that instead of accelerating the model using 2:4 sparsity, we can seek to exploit (1) with activation compression instead. The idea here is that we can use something like nvcomp to compress the sparse squared-relu activations.
We should run some tests to know what compression ratio and thus the memory savings we could achieve, as well as if there's additional overhead for the compression to account for.
Hi @jcaip, this seems an interesting take on activation sparsity. I would like to know if the model activations are highly sparse (>85% onwards), wont restricting them to a 50% sparsity be creating a hard upper bound? I think an unstructured sparse kernel make more sense in such scenarios, and makes for CPU inferencing a case as well.
@agrawal-aka Yes that's correct, we have a max 2x acceleration with 2:4 sparsity at 50%, but theoretically we can push this higher. The difficulty with unstructured sparsity is that 1) it is hard to accelerate on GPU 2) we need to efficiently create the metadata for the sparse matrix at runtime, as we don't have the activations beforehand. Doing so for a more general sparsity pattern is not something I've considered deeply but probably can be done (or at least it should be possible to figure out if this approach is feasible). I've been thinking about combining this with maybe https://openreview.net/pdf?id=gWHQQagPbN
CPU inferencing is a good point, but is this something that people care about? If so, I'd love to hear any insights you have here. I've been very GPU focused but not super familiar with the space.
Hi @jcaip,
Thanks for your response.
I’m interested in exploring how activation compression might be integrated into model inference. Could you clarify at what point in the forward pass the compression and subsequent decompression should occur? Additionally, are there any specific task items or preliminary PR ideas you’re considering for this feature?
On the topic of CPU inferencing, community work from ARM, AWS, Neural Magic, and Cerebras highlights a growing interest in efficiency improvements through quantization and sparsity. For example:
- ARM’s blog on LLM inference on the Neoverse V2 using int4 kernels
- AWS’s posts on optimized PyTorch 2.0 inference with Graviton processors, showing up to 50% cost savings
- AWS’s posts on SLM inferencing using CPUs
- Neural Magic’s demonstrations of significant speedups with fused sparse and quantized kernels
- Cerebras’s exploration of a 70% unstructured sparse LLaMA model achieving high accuracy with CPU inference via DeepSparse
These examples indicate a notable momentum around CPU-based inference, suggesting that further investigation into activation compression could prove valuable across both GPU and CPU contexts. Looking forward to your thoughts!
@agrawal-aka
Could you clarify at what point in the forward pass the compression and subsequent decompression should occur?
From my understanding, activation compression would be of minimal use during inference, because you don't need to store the activations for the backwards pass like you do during training. During training, instead of storing the full activation, you would compress it and stored the compressed activation, and during your bw pass, you would uncompress.
I think the only time that this would help for inference is if your model activations don't fit inside your GPU memory, in which case you could load the compressed activations instead of the full ones when doing activation checkpointing. cc @janeyx99 who might know better here.
Additionally, are there any specific task items or preliminary PR ideas you’re considering for this feature?
I think the first step is to see the overhead of these compression routines, I'm unfamiliar so it would be good to know how much memory / loading time we would save. I'm not planning to work on this as I'm busy working on the kernels for 2:4 activation sparsity ATM, if you're interested would gladly accept a PR.
Thanks for the links, will check them out. I think for edge / CPU contexts specifically, there may be more room for speedups as you are more likely to be memory bound than compute bound. cc @syed-shakib7 who might be interested in this as well.
thanks @jcaip, for giving the clarity about activation compression.
Also from my understanding for weight sparsity format creation is a one-time overhead before inferencing, but as you mentioned there is development in progress for 2:4 activation sparsity, how is the format creation overhead being handled in that case, if we have to do it at runtime?
As you mentioned,
From my understanding, activation compression would be of minimal use during inference
Currently, I am inclined towards working in inferencing, especially for CPU use cases. Do let me know if there are any task items or preliminary PR ideas you think from a weight/activation sparsity inference point of view. Would love to collaborate.
I've been thinking about combining this with maybe https://openreview.net/pdf?id=gWHQQagPbN
@jcaip Hi, I am interested in this (Sparse Attention). It seems that https://openreview.net/pdf?id=gWHQQagPbN focused on weight sparsification. Therefore BigBird’s Sparse Attention and FAVOR+ can be a choice for Sparse Attention.
But how about focusing on KV-compression or attention algorithms like FlashAttention because it is a well-established field? longctx_bench could be a choice for KV-compression techniques.