Issue about Computation-Efficiency
Dear Author,
Thanks for your fantastic work Vision Mamba and now I am following it.
I have some questions about the computation-efficiency in part 3.5:
I notice that you refer it to the Algorithm 1, so does the equation 8 includes a whole computational efficiency of a Vim block. Including casual-convolution and SSM?
I do not understand clearly how it correspond to Algorithm 1. Could you give me a simply prof to help me to understand.
I notice in mamba's original code space in github. Albert.Gu provide a 9BLDN analysis about mamba's selective scan operation. Have you noticed this. Does this approximately align with your analysis?
Thank you again and look forward for your response! For you convenience, I will attach the issue110 and Albert Gu's answer.
1. What Equation (8) in the Vim Paper Represents
Equation (8):
Ξ©(SSM)=3π(2π·)π+π(2π·)π2 Ξ©(SSM)=3M(2D)N+M(2D)N2
Here:
M = sequence length
D = hidden dimension
N = state dimension (fixed, e.g. 16 by default)
This is intended to capture the entire computation complexity of a Vim blockβs SSM module (not just the convolution). The complexity has:
A linear term in M:3π(2π·)π3M(2D)N (for projections and multiplications with state).
A quadratic term in N:π(2π·)π2M(2D)N2, because the internal recurrence requires combining state dimensions.
So yes β Eq. (8) includes the whole computational cost of the selective scan (SSM inside Algorithm 1), including the convolutional preprocessing (B, C, Ξt) and the scan step.
2. How This Corresponds to Algorithm 1
In Algorithm 1 (Vim block), the SSM pipeline has:
Linear projections of the input into B, C, Ξt (captured in the 3π(2π·)π3M(2D)N term).
The selective scan recurrence (captured in the π(2π·)π2M(2D)N2 term).
So Eq. (8) is basically a summarized FLOP estimate of everything after the linear layer projections, matching what Algorithm 1 describes.
3. Relation to Albert Guβs FLOP Analysis (9BLDN) Albert Guβs original FLOP formula for Mambaβs selective scan is: 9π΅πΏπ·π9BLDN
B = batch size
L = sequence length
D = hidden dimension
N = state dimension
That breakdown comes from carefully counting associative operations in the scan (2 multiplies + 1 add β 3 FLOPs, repeated ~2L times per head, leading to the 9 factor).
4. How They Align
Albert Guβs analysis (9BLDN) is a lower-level FLOP count focused just on the scan kernel (selective recurrence).
Vimβs Eq. (8) wraps this into a higher-level block cost by also including the projections (B, C, Ξt computation) and expressing it in big-O style, i.e. π(ππ·π+ππ·π2)O(MDN+MDN2).
So: They are consistent. The Vim paperβs equation (8) is a simplified asymptotic expression that aligns with the detailed kernel FLOP analysis (9BLDN).
Think of it this way:
Gu: exact FLOP accounting for the recurrence.
Vim paper: block-wise asymptotic complexity (linear vs quadratic scaling) that includes both projections and scan.