mamba icon indicating copy to clipboard operation
mamba copied to clipboard

About max token length

Open RevolGMPHL opened this issue 1 year ago • 28 comments
trafficstars

What is the max token length that this model can support? Can it support more than 10k?

RevolGMPHL avatar Dec 05 '23 08:12 RevolGMPHL

It was trained with seqlen=2k for apple to apple comparison with pythia, seems to extrapolate to around 3k context length but after that the quality is much worse.

tridao avatar Dec 05 '23 08:12 tridao

If I train on a longer sequence training set, will it improve max token length? Does it have anything to do with the size of the model?

RevolGMPHL avatar Dec 05 '23 09:12 RevolGMPHL

Yes training on longer context (e.g. 4k or 8k) should help improve max token length. I think this is a general property of most sequence models (e.g. Transformers should be similar).

tridao avatar Dec 05 '23 09:12 tridao

How to understand table.2 in mamba's paper, which shows great extrapolate ablility?🤔 As your paper shows, mamba could train at seqlen = 10^3 and test at seqlen=10^6 with good performance.🤔 image

EricLina avatar Dec 15 '23 07:12 EricLina

That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen.

tridao avatar Dec 17 '23 01:12 tridao

Language models based on the transformer architecture can extrapolate beyond the context by adjusting the position encoding, which may also require fine-tuning training on longer documents. There are also technical solutions that mitigate the degradation of performance during context extrapolation by filtering the kv cache.

I would like to understand the model structure and design of the Mamba S6, and whether there are similar technical solutions suitable for context extrapolation. Thank you.

ftgreat avatar Dec 20 '23 07:12 ftgreat

You can also finetune Mamba on long documents. Regarding "context extrapolation" without fine-tuning, the short answer is ... I don't know. The architecture is new different from Transformer, and there are still lots of interesting research questions.

tridao avatar Dec 20 '23 07:12 tridao

Thanks very much.

I am currently not familiar with the inner details of the Mamba ssm module. May I ask if there are some parameters which shapes are related to preset context length?

ftgreat avatar Dec 20 '23 07:12 ftgreat

There's no restriction, e.g. you can just pass in sequence of length 8k to finetune.

tridao avatar Dec 20 '23 08:12 tridao

@tridao Does Mamba support passing state between multiple forward passes (or blocks of tokens) during training?

sentialx avatar Dec 23 '23 01:12 sentialx

No that's not supported right now.

tridao avatar Dec 23 '23 01:12 tridao

@tridao one more question about dataset processing during pretrain mamba-2.8 models.

As gpt3 paper said, "During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency.".

Did released mamba models use same packing tricks for datasets, thanks.

image

ftgreat avatar Dec 25 '23 10:12 ftgreat

Yes we do exactly the same thing (which is now standard is several libraries): tokenize all documents, append "eos" token to the end of each document, concatenate all of them, the split into chunks of size 2048.

tridao avatar Dec 25 '23 18:12 tridao

@tridao one more question please.

How to set Layers & Model dim for round 7B Mamba models, and are there design rules of model settings for model size scaling?

Thanks.

ftgreat avatar Dec 26 '23 11:12 ftgreat

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

tridao avatar Dec 26 '23 17:12 tridao

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

Thanks.

ftgreat avatar Dec 27 '23 02:12 ftgreat

@tridao could you release mamba-1.4B intermediate checkpoint which is trained around 100B tokens?

I have trained mamba-1.4B from scratch using zh-en corpus. If checkpoint around 100B tokens is provided, I will check the metrics to validate the process.

Thanks

ftgreat avatar Dec 28 '23 01:12 ftgreat

Unfortunately we only have the fully trained weights.

tridao avatar Dec 28 '23 01:12 tridao

Unfortunately we only have the fully trained weights.

Thanks for your reply.

ftgreat avatar Dec 28 '23 01:12 ftgreat

@tridao When scaling up max length for language modeling pretrain from sractch.

Could you please give us some advice about how to set hyperparameters like lr, warmup, global batch size, etc?

Thank you.

ftgreat avatar Dec 29 '23 09:12 ftgreat

The paper describes the hyperparameters we used. When increasing sequence length we decrease batch size (i.e. keeping the total number of tokens in the batch the same), and keep other hparams the same. I'm not sure that's optimal but it's what I've been using.

tridao avatar Dec 29 '23 09:12 tridao

@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?

sentialx avatar Jan 05 '24 12:01 sentialx

@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?

inference_params supports moving the state forward by 1 step (i.e. recurrence). If you want to pass the states with length more than 1, you'd need to change the parallel scan (in selective_scan) to deal with that.

tridao avatar Jan 05 '24 17:01 tridao

Mamba can be used as a module that can be drop-in replaced in some frameworks.

Megatron-LM is designed only for Transformer blocks. How can we integrate Mamba into it, could you give some advice, thanks.

Sorry to bother you both. @tridao @albertfgu

ftgreat avatar Jan 12 '24 09:01 ftgreat

Instead of ParallelTransformerLayer in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.

tridao avatar Jan 12 '24 19:01 tridao

Instead of ParallelTransformerLayer in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.

Thanks a lot. Without TensorParallel / Pipeline Parallel, for model size scaling no need to use MegatronLM.

ftgreat avatar Jan 13 '24 01:01 ftgreat

@tridao If there is no causal_conv1d_fn , how does the normal conv1d perform causally? Thanks

https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba_simple.py#L168

ftgreat avatar Jan 19 '24 09:01 ftgreat

As the code says, it constructs nn.Conv1d with padding=3 (if conv has width 4), do the convolution, then remove the last 3 elements.

tridao avatar Jan 19 '24 09:01 tridao