ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Training takes > 40 minutes to start

Open mcDandy opened this issue 3 months ago • 8 comments

This is for bugs only

Did you already ask in the discord?

Yes

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes

Describe the bug

Yesterday I was able to train. I tried Flux since I can easily train it with onetrainer on 12GB. There is nothing like block swapping. I tried low-vram mode with no results. It got stuck on generating images. I tried stopping it, It did not stop. Tried deleting it, nothing. I force stopped the training process by closing the console, relaunched gui and deleted job.

One sleep (pc in sleep mode) later I want to try to get a feel for the program by using smaller net which I am sure fits in VRAM in it´s entirity. I create job with default settings, SD1.5 model, a few random images and default caption.... Error in first second. Training console completly empty.

Only thing showing that something even happened is [UI] Job 88238482-9bf3-473c-8363-ebdf68136116 exited with code 0 after 0.024 seconds. and Error launching job: in info on jobs view. There are no error messages anywhere.

mcDandy avatar Aug 27 '25 13:08 mcDandy

It took something like an hour but after stopping frontend, pulling latest commit, starting frontend and forgetting about backend running, the job started. Still a bug, there is no indication why it does not start after more than 40 minutes.

It is not becouse of anything I did/did not do yesterday. It just sometimes does not start.

mcDandy avatar Aug 27 '25 14:08 mcDandy

I've run into something similar. I can train a job with everything working fine. I shut down my machine, start it up the next day (with no updates at all), launch the application clone the same job that ran yesterday and try to run it again, and the job never starts.

Normally in the shell window on the job overview page it will prompts "starting 1 job" or something like that within a few seconds of starting the job. But anytime this slow startup happens, that message doesn't get displayed. The UI nor the console show any errors or messages about any problems.

Nothing changed in the OS (Linux), the application source hasn't been updated, the job is the same, the dataset is the same... but it wont start. And the lack of any messages makes it impossible to debug. Is there a debug mode we can start the application in which will spam the console with what's going on?

q5sys avatar Oct 25 '25 17:10 q5sys

I've run into something similar. I can train a job with everything working fine. I shut down my machine, start it up the next day (with no updates at all), launch the application clone the same job that ran yesterday and try to run it again, and the job never starts.

Normally in the shell window on the job overview page it will prompts "starting 1 job" or something like that within a few seconds of starting the job. But anytime this slow startup happens, that message doesn't get displayed. The UI nor the console show any errors or messages about any problems.

Nothing changed in the OS (Linux), the application source hasn't been updated, the job is the same, the dataset is the same... but it wont start. And the lack of any messages makes it impossible to debug. Is there a debug mode we can start the application in which will spam the console with what's going on?

Having the same issue. Have you found a solution??

yovsac avatar Oct 30 '25 15:10 yovsac

For me, I had an issue with it silently not being able to find the venv. I had created a venv for it at ~/ai/envs/ai-toolkit and it was looking for a venv at ~/ai/src/ai-toolkit/.venv or ~/ai/src/ai-toolkit/venv

After symlinking the venv into the src clone directory and restarting job it worked. Not sure if anyone else is running into this but the worker was crashing silently immediately after starting training and the UI gives no feedback.

ExoticArts avatar Nov 09 '25 09:11 ExoticArts

I think the real bug is that there are no obvious logs to help figure out why training takes forever to start.

akorchemniy avatar Nov 30 '25 14:11 akorchemniy

I tried multiple instances on vast.ai and training never seems to start. It's stuck on 'Starting job...'

Denarius40 avatar Dec 02 '25 21:12 Denarius40