Why have a default max_seq_length of 256?
I noticed that both the data prep / tokenization script (https://github.com/Lightning-AI/lit-parrot/blob/main/scripts/prepare_alpaca.py#L26) and the fine-tuning scripts (https://github.com/Lightning-AI/lit-parrot/blob/main/finetune/adapter.py#L41, https://github.com/Lightning-AI/lit-parrot/blob/main/finetune/adapter_v2.py#L46) have max_seq_length=256.
While this does seem to speed up tokenization it has the unfortunate property of truncating fine-tuning inputs and also requiring someone to change both scripts to use the full context length of a language model. I'm curious why this parameter got added and whether it might be possible to switch to a default of None or 4096?
That is a good observation. I agree that we should remove the max_seq_length value passed to the model forward and separate it from the max_length used to split the dataset.
I don't know why this was chosen originally. Perhaps a mixup.
Would you be interested in opening a PR?
Sure, let me take a look tomorrow.
One question: should the data prep stage just take the model checkpoint dir (or model name) as an option so it can look up the model's'block_size'? I just wasted a bunch of time tokenizing fine-tuning data with max_seq_length=4096 and then realized that falcon-7b-instruct has a block size of 2048. This makes me suspect that it's best to avoid setting this parameter manually and instead just have it disappear as part of tokenization for a specific LLM.
On Wed, Jun 7, 2023, 6:24 PM Carlos Mocholí @.***> wrote:
That is a good observation. I agree that we should remove the max_seq_length value passed to the model forward and separate it from the max_length used to split the dataset.
I don't know why this was chosen originally. Perhaps a mixup.
Would you be interested in opening a PR?
— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/lit-parrot/issues/122#issuecomment-1581592668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL2OJU5FP76FEMEEM5HJDXKD5P7ANCNFSM6AAAAAAY6N2HZA . You are receiving this because you authored the thread.Message ID: @.***>
Yes, definitely! Great idea, thanks @iskandr
Started a PR (https://github.com/Lightning-AI/lit-parrot/pull/127/) but still have to actually test that it doesn't change or break anything.