domain specific fine-tuning
hi team,
I wanted to fine-tuning full parameters of gemma model, I noticed there is an example, https://github.com/Lightning-AI/litgpt/blob/main/litgpt/finetune/full.py , can I use this example for domain specific fine-tuning?
I have prepared the dataset, I converted each page from pdf as each row in csv, and with end of sentence <eos> delimiter.
- additionally, I wanted to check whether
GaLorehas been implemented in Lightning?
@carmocca
I wanted to fine-tuning full parameters of gemma model, I noticed there is an example, https://github.com/Lightning-AI/litgpt/blob/main/litgpt/finetune/full.py , can I use this example for domain specific fine-tuning?
Yes, this model is compatible with finetuning. You could start with one of our config files and then just swap out the dataset. E.g.,
litgpt finetune \
--config https://raw.githubusercontent.com/Lightning-AI/litgpt/main/config_hub/finetune/gemma-2b/full.yaml \
--data ...
or
litgpt finetune \
--config <your updated config file>
additionally, I wanted to check whether GaLore has been implemented in Lightning?
No, I think that's not been super high interest in that, but there's been an open issue for that though if you want to contribute: #1075. We'd definitely appreciate it! I think that would require some discussion of how the user interface would look like via the config file etc. But I think it should be a relatively straightforward update if you want to just implement it in a your local litgpt as a local copy for now.
Do I have to maintain any dataset structure or format for domain based fine tuning? Because in instruction based fine tuning I maintain a dataset format like system prompt + instruction + context + question.
Just wanted to check with you whether do I need to maintain a dataset structure? Because I have 480 pdf files and each pdf file consist around 250 pages.
Do I need to use any delimiter? If yes, can you please provide some sample.
@rasbt @carmocca Regarding contributions yes I can do it. Once I complete this I can start that.
Sorry, for the late responses, but it's been a busy week. Regarding the domain-specific finetuning, it would be similar to continued pretraining without instructions in your case, correct? Regarding the delimiter, I'd say this is best handled by using the eos token via the respective tokenizer when you are reading and tokenizing the dataset, e.g., tokenizer.encode(text, bos=False, eos=True).
@awaelchli has a nice working example here in the Continued Pretraining with TinyLlama 1.1B Studio that might come in handy as a template.
PS: Regarding GaLore, I started a PR here: #1192