OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug]: Memory leak & OOM crash when continuing full-layer LoRA with attention-only setting

Open madrooky opened this issue 7 months ago • 5 comments

What happened?

Its a bit of my own fault but also an edge case that could be handled more gracefully.

I'm experimenting with a base Lora that is supposed to serve as starting point for other things. This Lora is trained on all layers. If I however try to use this Lora as base and have set the layers to attention only and start training, the system memory is filling up indefinitely until the screens turn black and eventually come back when VSC in which I run the terminal shuts down^^ I have 32gb system memory, 24gb fixed pagefile on system SSD, and another dynamic page file on another SSD. Usually it reserves about 77GB in total. When I do this mistake, the system is committing to more and more memory, exceeding easily 100Gb until it hits a barrier and the whole thing crashes as described above.

Otherwise when using the same layer setting the training works just fine and the system only commits about 40GB and uses the 16GB vram I have available according to the settings determining vram usage.

What did you expect would happen?

It would be nice if OT was able to detect the incompatible base Lora and exits with an error that offers some information, instead of going into a memory frenzy and giving me the impression my system is breaking down... :D

Relevant log output


Generate and upload debug_report.log

No response

madrooky avatar Jul 14 '25 17:07 madrooky

Without a config.json file (ctrl+f replace your username) and debug_report.log and the name/link to the exact supposedly broken lora, Nero would be guessing. Please make your bug report useful to him so he can actually solve this problem instead of making him do abunch of guesswork.

If you dont know how to provide the first two, let me know.

O-J1 avatar Jul 14 '25 17:07 O-J1

Well, there is the setting. I cannot provide a debug log as is doesn't exist. I am not sure if I want to abort my now running training trying to produce valid log, sorry I did not copy console log when it was available and I don't have debug mode enabled. The Lora's I am using are not uploaded anywhere and wont be in the state they are. They are not broken, btw, that must have been a misunderstanding. Training a full layer lora for 1 step and then trying to continue it with attention only should yield the same result, as the content doesn't matter but the data structure will be the same and suffice for reproduction.

2025-07-14_18-42-06.json

madrooky avatar Jul 14 '25 17:07 madrooky

Do it after the run has finished/tmrw. Dbl click export_debug.bat in the OneTrainer root folder.

O-J1 avatar Jul 14 '25 17:07 O-J1

Following up on this

O-J1 avatar Jul 17 '25 13:07 O-J1

debug_report.log

madrooky avatar Jul 22 '25 20:07 madrooky

Modified to feature request, not a bug in my eyes, the user selected settings larger the resources they had available, it overflowed into pagefile and caused intense system lag because you are overloading the IMC on your CPU and bottlenecked by NAND by being orders of magnitude slower than RAM, let alone VRAM

O-J1 avatar Dec 01 '25 16:12 O-J1