mamba icon indicating copy to clipboard operation
mamba copied to clipboard

Is Context Length dependent on training data's context?

Open RonanKMcGovern opened this issue 1 year ago • 6 comments
trafficstars

I notice that passkey retrieval works well up to around 3-4k tokens. After that, it doesn't.

That wasn't my intuition for SSMs, but I guess context length is still related to the training set? It's just that - given a longer training set - SSMs will be much more efficient than transformers (linear rather than quadratic) at inference?

RonanKMcGovern avatar Jan 29 '24 12:01 RonanKMcGovern

The models were trained with 2k context, it's cool that passkey retrieval works up to 3-4k tokens. Would be cool to train Mamba with longer context and see how it performs on passkey retrieval. It's still an open question.

tridao avatar Jan 29 '24 19:01 tridao

I'll give it a go.

Length extrapolation in general seems hard. Possibly there needs to be a decay function - although one might think that, with the state being overwritten, "decay" could already be built in. I suppose that's not the case though, because - during training - the model just never sees (even if trained on 100k) examples of where it is correct to ignore stuff at the start of the context. If anything, many training datasets are encouraging models to pay a lot of attention to the start of the sequence. If the model saw my lifetime of text and what I remember now, then perhaps it would result in back-propagation capturing this sense of decay.

RonanKMcGovern avatar Jan 29 '24 20:01 RonanKMcGovern

I have tried to address this in my Mamba training by normalizing the loss throughout the training context and increasing linearly at each sequence so the first token is 1/context_length * cross_entropy_loss and the last token is context_length/context_length * cross_entropy_loss. seems to help with learning state composition and preventing overfitting the initial tokens. I am thinking to try an exponential or quadratic curve to the sequence as well.

Corallus-Caninus avatar Aug 21 '24 17:08 Corallus-Caninus

interesting idea, not seen that before! Is it measurably better in any way?

albertfgu avatar Aug 21 '24 17:08 albertfgu

Interesting, what's your thinking on scaling down the start of sequence's loss - I would have probably started by scaling down the loss for the latest tokens...

On Wed, Aug 21, 2024 at 6:22 PM Albert Gu @.***> wrote:

interesting idea, not seen that before! Is it measurably better in any way?

— Reply to this email directly, view it on GitHub https://github.com/state-spaces/mamba/issues/140#issuecomment-2302597551, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CQDQGVIR7O3BINSLFTZSTEELAVCNFSM6AAAAABCPMPFE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGU4TONJVGE . You are receiving this because you authored the thread.Message ID: @.***>

RonanKMcGovern avatar Aug 22 '24 09:08 RonanKMcGovern