ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Training speed significantly decreases after completing a stage in AI toolkit trainer

Open yuanxiao9889 opened this issue 2 weeks ago • 4 comments

Body: The AI toolkit trainer works very fast initially. However, after training a stage model and starting the next stage, the training speed becomes extremely slow and significantly decreases. I have to restart the trainer to continue training and restore the speed. Labels: performance-issue, training-speed, stage-switch, restart-required

yuanxiao9889 avatar Dec 05 '25 14:12 yuanxiao9889

#504

meknidirta avatar Dec 05 '25 15:12 meknidirta

same

elen07zz avatar Dec 07 '25 22:12 elen07zz

check your vram, there is some kind of bug that is leaving something using the shared memory after saving a checkpoint. it will overload your vram (even with 10-20GB vram free) after saving a checkpoint and stay using the shared memory. This is causing the slowdown. It should properly release that.

SarahPeterson2854 avatar Dec 08 '25 11:12 SarahPeterson2854

check your vram, there is some kind of bug that is leaving something using the shared memory after saving a checkpoint. it will overload your vram (even with 10-20GB vram free) after saving a checkpoint and stay using the shared memory. This is causing the slowdown. It should properly release that.检查一下你的显存,有个 bug 在保存检查点后会留下共享内存。保存检查点后,即使有 10-20GB 的 VRAM,也会让你的内存过载,并且继续使用共享内存。这导致了卡顿。它应该能正确释放那个。

may I ask how to turn off the shared memory function?

blakejohnaldo avatar Dec 09 '25 14:12 blakejohnaldo