mamba
mamba copied to clipboard
On the small model, the actual GPU memory usage of Mamba2 is much higher than that of Mamba1.
The parameters of the Mamba2 model are d_state=32, d_conv=4, expand=2, and head_dim=32 (using "nn. Conv1d" with padding method, without the constraint of d_model/head_dim%8==0). Mamba1 maintains the same parameters except for the absence of head_dim. Although the inference speed of Mamba2 has almost doubled compared to Mamba1, the actual memory usage has increased from 4.82G to 7.55G (in my task). I would like to ask if this is due to the basic computational load of Mamba2's semi separation matrix, which poses a disadvantage in small-scale models? I see in your paper that on larger scale models, the actual memory usage of Mamba2 is lower.