litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

[question] how to finetune efficiently

Open nevermet opened this issue 2 years ago • 4 comments

Dear all,

Finetuning my data takes so long (more than 24 hours). In this case, how can I shorten the time?

Can I know the estimated end time?

Can I stop the finetuning at some point and resume finetuning from the point that I stopped at?

Thanks in advance.

nevermet avatar Oct 28 '23 10:10 nevermet

Finetuning my data takes so long (more than 24 hours). In this case, how can I shorten the time?

If you are not doing it already, you could try bf16-true precision, use a smaller model, use more GPUs (if available), or run fewer iterations. You might like the resource tables here :).

Can I know the estimated end time?

Personally, what I usually do is a test run on 1000 iters and then extrapolate to the max iterations.

Can I stop the finetuning at some point and resume finetuning from the point that I stopped at?

Yes you can :). For that, you just need to copy the .json config files from the checkpoint_dir folder to the folder that contains your finetuned checkpoint file (and rename that finetuned checkpoint file). Then you can use the finetuned target folder as --checkpoint_dir input. If you use LoRA, you need to merge the weights first. There's an explanation in the tutorials here that might help.

rasbt avatar Oct 28 '23 14:10 rasbt

Hi, I was wondering if I should also copy the lit converted model to the folder that contains my finetuned checkpoint file for resuming finetuning? It seems like if I don't do so it would throw an error saying that no model exist.

altria-zewei-wang avatar Apr 24 '24 16:04 altria-zewei-wang

Could you share the commands you ran, it might be a bit easier to discuss.

But in general, I think you could do the following without moving:

Finetune model:

litgpt finetune ... --out_dir out/my_model

Finetune finetuned model:

litgpt finetune ... --checkpoint_dir out/my_model/final

rasbt avatar Apr 25 '24 17:04 rasbt

Actually, what is described above is not really resuming: training restarts from scratch from a saved checkpoint, which means there is a new warmup phase, step count is restarted from zero and so on. True resuming is restarting from the last saved state with all parameters set in that specific state.

lancioni avatar Jun 02 '24 21:06 lancioni