Tri Dao comments

Results 250 comments of


                                            Tri Dao

About max token length

That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen.

You can also finetune Mamba on long documents. Regarding "context extrapolation" without fine-tuning, the short answer is ... I don't know. The architecture is new different from Transformer, and there...

About max token length

There's no restriction, e.g. you can just pass in sequence of length 8k to finetune.

About max token length

No that's not supported right now.

About max token length

Yes we do exactly the same thing (which is now standard is several libraries): tokenize all documents, append "eos" token to the end of each document, concatenate all of them,...

About max token length

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

About max token length

Unfortunately we only have the fully trained weights.

About max token length

The paper describes the hyperparameters we used. When increasing sequence length we decrease batch size (i.e. keeping the total number of tokens in the batch the same), and keep other...

About max token length

> @tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch? inference_params supports moving the...

About max token length

Instead of `ParallelTransformerLayer` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/blob/2fd2882f4648c71993ded0090b2fbd41a1f71583/megatron/model/transformer.py#L849) you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.