mamba
mamba copied to clipboard
Sequence parallelism in the mixer (Context Parallelism)
The general question is, does mamba-ssm currently support sequence parallelism in the mixer?
I noticed that Section 8.2 in the paper of Mamba2 proposes a potential way to split activation among multiple devices during mixing information among tokens. Does current version of mamba-ssm support such context-parallelism scheme?
By the way, if it is possible to confirm that, the suggested implementation should be incorporated into the fast scan algorithm. As a parallel tree traversing algorithm, each node should be calculated on a single device. In the leaf-to-root pass, the communication will be invoked when two brother nodes are calculated on different devices to transmit the hidden information; in the root-to-leaf pass, the communication is similarly triggered. I show a simple illustration on how to implement CP. As a result, the CP_SIZE is also determined by the number of children when implementing the fast scan algorithm. (Just to confirm whether I am understanding correctly, thx)