litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

`pretrain` vs `finetune_full`

Open fdalvi opened this issue 1 year ago • 3 comments

Hello,

I was wondering what is the motivation behind pretrain vs finetune_full; conceptually they are quite similar, but at the moment there are some key (seemingly artificial) differences:

  • pretrain and finetune_full have different set of required arguments (e.g. max_tokens vs epochs)
  • activation_checkpointing seems to be only enabled for finetune_full (https://github.com/Lightning-AI/litgpt/blob/f80fefff6a5d127cc99c86be2d172415b49359d2/litgpt/finetune/full.py#L101 vs https://github.com/Lightning-AI/litgpt/blob/f80fefff6a5d127cc99c86be2d172415b49359d2/litgpt/pretrain.py#L134
  • Both use different data loaders
  • Both load models slightly differently leading to some bugs (e.g. #1430)

There seem to be other small differences as well as I'm walking through the code, so just wanted to understand the motivation and see if there is a "correct" time to use one of the other.

Thanks!

fdalvi avatar Jul 19 '24 09:07 fdalvi

Hi there, these are good questions. Off the top of my head, the major usage difference is the dataset. The finetune_* scripts are mainly designed for instruction-finetuning. (I wanted to name them like this, but I remember that this was not a popular opinion and also a bit late in the development where we already had these names).

So, in other words the data format is a bit different and with that also the scale. In the finetuning scripts the data would be small enough to fit into memory, and pretrain is designed to handle much larger datasets (here, raw data that doesn't come in the insruction-response format).

I think the differences like max_tokens and epochs come originally from the fact that we have training examples (instruction-response pairs) in the finetune_* scripts. In regular pretraining, where we have raw text, it's easier to work with max tokens (which is common in the literature).

I hope this helps as a start :)!?

rasbt avatar Jul 19 '24 19:07 rasbt

Hi @rasbt,

Thanks for the quick reply, that makes a lot of sense! Given that the primary difference is the data, would it be better then to have a shared codebase for all model related things (e.g. model loading, training loop, sharing strategies etc)?

Best,

fdalvi avatar Jul 21 '24 07:07 fdalvi

That's a fair point, but there is this philosophy in this repo that some code duplication isn't bad if it helps with readability. Because too much refactoring and code sharing can lead to lots of complexities if you want to read and modify the code. I.e., the code should remain simple enough that you can modify it if you want to tweak certain things for custom research projects. Of course, there is never a clear line to draw ...

Anyways, thanks for sharing your feedback here!

rasbt avatar Jul 21 '24 12:07 rasbt

Closing to clean up the issues a bit. But please feel free to respond or reopen in case you have additional questions.

rasbt avatar Jul 25 '24 17:07 rasbt

I appreciate your perspective. If its okay, I'll open a PR sometime soon that brings the two codebases closer together where applicable (FSDP settings, model loading, perhaps a few more things); we can discuss which of these are worth merging ofcourse!

fdalvi avatar Jul 28 '24 06:07 fdalvi