Tri Dao comments

Results 432 comments of


                                            Tri Dao

trafficstars

A kind suggestion on the implementation of discretization.

@OliverHxh do you have an idea on how to do discretization in pytorch while remaining efficient?

Can you clarify the training process?

> "We use the Pile dataset (L. Gao, Biderman, et al. 2020), and follow the training recipe described in Brown et al. (2020)." Is this for the transformer models, or...

Can you clarify the training process?

Yes that's right.

Question about support for sequence parallel

In general, yes. Which flavor of sequence parallelism are you referring to? The one in Megatron-LM?

Question about support for sequence parallel

Nothing is built-in, but it'll be implemented in the future.

CPU inference?

Do you want to send a PR for the conv1d? The selective_scan operation is also implemented in CUDA, but there's a reference implementation in Pytorch (probably quite slow).

Question about does mamba support variable-length input or cu_seqlens like flash attention?

Yes, there should be ways to deal with variable length. It's not implemented yet however.

Question about does mamba support variable-length input or cu_seqlens like flash attention?

It's theoretically possible to process variable lengths / packed sequences, but the implementation will be a bit tricky. Parallelizing over seq_len dimension reduces to how one would parallelize associative scan...

Memory issues for d_model above 100

The model seems very small, but the GPU also only has 4GB of memory? Maybe try different layers (e.g. MLP) of similar sizes to see if that also OOM. If...

would mamba provide tensorflow wrapper version?

We don't have experience with Tensorflow but we welcome contributions.