[question] how to finetune efficiently
Dear all,
Finetuning my data takes so long (more than 24 hours). In this case, how can I shorten the time?
Can I know the estimated end time?
Can I stop the finetuning at some point and resume finetuning from the point that I stopped at?
Thanks in advance.
Finetuning my data takes so long (more than 24 hours). In this case, how can I shorten the time?
If you are not doing it already, you could try bf16-true precision, use a smaller model, use more GPUs (if available), or run fewer iterations. You might like the resource tables here :).
Can I know the estimated end time?
Personally, what I usually do is a test run on 1000 iters and then extrapolate to the max iterations.
Can I stop the finetuning at some point and resume finetuning from the point that I stopped at?
Yes you can :). For that, you just need to copy the .json config files from the checkpoint_dir folder to the folder that contains your finetuned checkpoint file (and rename that finetuned checkpoint file). Then you can use the finetuned target folder as --checkpoint_dir input. If you use LoRA, you need to merge the weights first. There's an explanation in the tutorials here that might help.
Hi, I was wondering if I should also copy the lit converted model to the folder that contains my finetuned checkpoint file for resuming finetuning? It seems like if I don't do so it would throw an error saying that no model exist.
Could you share the commands you ran, it might be a bit easier to discuss.
But in general, I think you could do the following without moving:
Finetune model:
litgpt finetune ... --out_dir out/my_model
Finetune finetuned model:
litgpt finetune ... --checkpoint_dir out/my_model/final
Actually, what is described above is not really resuming: training restarts from scratch from a saved checkpoint, which means there is a new warmup phase, step count is restarted from zero and so on. True resuming is restarting from the last saved state with all parameters set in that specific state.