LLaMA-LoRA-Tuner
LLaMA-LoRA-Tuner copied to clipboard
Refresh instead of timing out
My current training takes 35 hours, it will time out - unless we refresh or increase the timeout substantially

I'm thinking of not relying on Gradio's loading for the training process, don't think it's suitable for things that will last for minutes or hours. Can't monitor the progress on multiple devices, and it won't be possible to hook back into the training progress once the page is closed or disconnected - have to rely on the terminal to monitor the progress or abort it.
Instead, we can put the training into a subprocess, run it in the background and let the UI poll for its status, enabling us to see and control the progress on multiple devices. Have to craft a loading UI and block other features, such as inference, during fine-tuning, though.
Another thing I want to do is to add CLI support, so I can do long fine-tuning on SkyPilot's managed spot instance or terminate the machine automatically after fine-tuning ended to save cost.
Nice, let me know how I can help!
Update: this has now been merged into
main.
I just implemented it on the dev-2 branch. Now it's possible to track the training progress on multiple devices, even on phones. Please feel free to give it a try and see if there're any issues.
I'll merge it back to main after testing on Colab (no free resource now).
The current known issue is that some processes, such as loading the base model or mapping the training dataset, can't be aborted immediately by clicking the abort button on the UI - will have to wait for that process to finish to get actually aborted.
