LLaMA-LoRA-Tuner Refresh instead of timing out

My current training takes 35 hours, it will time out - unless we refresh or increase the timeout substantially

Apr 19 '23 17:04 l0rinc

I'm thinking of not relying on Gradio's loading for the training process, don't think it's suitable for things that will last for minutes or hours. Can't monitor the progress on multiple devices, and it won't be possible to hook back into the training progress once the page is closed or disconnected - have to rely on the terminal to monitor the progress or abort it.

Instead, we can put the training into a subprocess, run it in the background and let the UI poll for its status, enabling us to see and control the progress on multiple devices. Have to craft a loading UI and block other features, such as inference, during fine-tuning, though.

Another thing I want to do is to add CLI support, so I can do long fine-tuning on SkyPilot's managed spot instance or terminate the machine automatically after fine-tuning ended to save cost.

Apr 19 '23 17:04 zetavg

Nice, let me know how I can help!

Apr 19 '23 18:04 l0rinc

Update: this has now been merged into main.

I just implemented it on the dev-2 branch. Now it's possible to track the training progress on multiple devices, even on phones. Please feel free to give it a try and see if there're any issues.

I'll merge it back to main after testing on Colab (no free resource now).

The current known issue is that some processes, such as loading the base model or mapping the training dataset, can't be aborted immediately by clicking the abort button on the UI - will have to wait for that process to finish to get actually aborted.

Apr 24 '23 03:04 zetavg

LLaMA-LoRA-Tuner LLaMA-LoRA-Tuner copied to clipboard

Refresh instead of timing out

LLaMA-LoRA-Tuner
LLaMA-LoRA-Tuner copied to clipboard