Tri Dao

Results 250 comments of Tri Dao

That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen.

You can also finetune Mamba on long documents. Regarding "context extrapolation" without fine-tuning, the short answer is ... I don't know. The architecture is new different from Transformer, and there...

There's no restriction, e.g. you can just pass in sequence of length 8k to finetune.

No that's not supported right now.

Yes we do exactly the same thing (which is now standard is several libraries): tokenize all documents, append "eos" token to the end of each document, concatenate all of them,...

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

Unfortunately we only have the fully trained weights.

The paper describes the hyperparameters we used. When increasing sequence length we decrease batch size (i.e. keeping the total number of tokens in the batch the same), and keep other...

> @tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch? inference_params supports moving the...

Instead of `ParallelTransformerLayer` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/blob/2fd2882f4648c71993ded0090b2fbd41a1f71583/megatron/model/transformer.py#L849) you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.