litgpt Make max_seq_length an optional argument in prepare

This PR attempts to fix https://github.com/Lightning-AI/lit-parrot/issues/122 by making max_seq_length an optional parameter in scripts/prepare_alpaca.py, 'finetune/adapter.py and 'finetune/adapter_v2.py.

Currently, all of these scripts use max_seq_length=256, which I suspect truncates inputs to the models. There is also an annoying manual dependency between changing this value in one script and then having to make the others match. Instead, this PR uses the block_size of the checkpointed model as a default value for max_seq_length which can still be manually overridden from the commandline.

I also changed a few places where functions in prepare_alpaca.py had contradictory or confusing default values for arguments and instead expose all the defaults as top-level defined global values which can also be overridden via the CLI.

@carmocca -- this is still a WIP since I'm just starting to learn how to use both Lightning and the lit-parrot codebase, but curious to get your feedback.

Jun 08 '23 21:06 iskandr

Synthesis of t3100/20

michael hardiman - Chat @ Spike [23z7du]

On June 9, 2023 at 16:54 GMT, Lightning-AI/lit-parrot @.***> wrote:

@iskandr commented on this pull request.

In finetune/adapter_v2.py:

@@ -169,7 +183,11 @@ def validate(fabric: L.Fabric, model: torch.nn.Module, val_data: np.ndarray, tok prompt = generate_prompt(sample) encoded = tokenizer.encode(prompt, device=model.device) output = generate( - model, idx=encoded, max_returned_tokens=len(encoded) + 100, max_seq_length=max_seq_length, temperature=0.8 + model, + idx=encoded, + max_returned_tokens=len(encoded) + 100, + max_seq_length=max_seq_length, Related question, if the fine-tuning scripts don't actually need max_seq_length then it seems fine to change the config loading to config = Config.from_name(name=checkpoint_dir.name)from config = Config.from_name(name=checkpoint_dir.name, block_size=max_seq_length) but I want to make sure I'm not misunderstanding how things work.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jun 09 '23 16:06 neo111000

OK, ran the scripts on small locally generated data. Everything seems good.

As long as @carmocca can confirm that the fine-tune scripts never really needed a max_seq_length parameter then I think I might be done?

Jun 09 '23 18:06 iskandr

@carmocca for TPU training, we need max_seq_length=256 so we do not run into any XLA recompilations during finetuning(as can be seen in #110). If we use block_size (4096), you could easily either run out of memory or signifcantly slow down finetuning. Maybe we revert this PR to some degree? What are your thoughts?

Jun 12 '23 05:06 gkroiz

@gkroiz

What do you think if we add --max_seq_length to the fine-tuning scripts?

I think max_seq_length=block_size is what most people will expect wrt the published context sizes of the models, so TPU training would have to explicitly override with --max_seq_length=256 for data prep and fine-tuning.

Jun 12 '23 13:06 iskandr

Adding it as an optional argument would work @iskandr.

Jun 12 '23 15:06 gkroiz

@gkroiz Why does block_size run into recompilations but 256 doesn't?

256 would use less memory, but it could limit learning depending on your data's length.

Jun 12 '23 16:06 carmocca

@carmocca Maybe my last message was a little confusing. Using block_size as a replacement for max_sequence_length would not result in recompilation, but it uses a lot more memory. Now instead of each training sample defined with size 256, they are each defined with a size of 4096, which is 16x. From my initial tests, this 16x increase in data size does not fit on the memory of TPU v4-8.

Also, I noticed that even though we set max_sequence_length=block_size=4096, I believe the actual max_sequence_length after preparing data through prepare_alpaca.py is 1034. Having the optionality to set max_sequence_length to 1034 (or even smaller, although you limit learning) would help minimize the large memory usage.

Jun 12 '23 17:06 gkroiz

Thanks for the explanation.

For pretraining, one can decrease the micro_batch_size. The data is packed together in a sample so 4 batches of 10 should be approximately equal to 1 batch of 40.

For fine-tuning, let's add the configurable argument. And explain how to properly tweak it in its docstring, meaning, based on the max length of your dataset to save memory.

Jun 12 '23 17:06 carmocca

I opened https://github.com/Lightning-AI/lit-parrot/pull/143 which does the above but automatically by saving a config.json file in the data directory with the optimal max_seq_length

Jun 14 '23 03:06 carmocca

Great, glad you're taking care of it @carmocca.

Jun 14 '23 21:06 iskandr

Make max_seq_length an optional argument in prepare_alpaca and fine tuning scripts

@iskandr commented on this pull request.