[Bug]: Memory leak & OOM crash when continuing full-layer LoRA with attention-only setting
What happened?
Its a bit of my own fault but also an edge case that could be handled more gracefully.
I'm experimenting with a base Lora that is supposed to serve as starting point for other things. This Lora is trained on all layers. If I however try to use this Lora as base and have set the layers to attention only and start training, the system memory is filling up indefinitely until the screens turn black and eventually come back when VSC in which I run the terminal shuts down^^ I have 32gb system memory, 24gb fixed pagefile on system SSD, and another dynamic page file on another SSD. Usually it reserves about 77GB in total. When I do this mistake, the system is committing to more and more memory, exceeding easily 100Gb until it hits a barrier and the whole thing crashes as described above.
Otherwise when using the same layer setting the training works just fine and the system only commits about 40GB and uses the 16GB vram I have available according to the settings determining vram usage.
What did you expect would happen?
It would be nice if OT was able to detect the incompatible base Lora and exits with an error that offers some information, instead of going into a memory frenzy and giving me the impression my system is breaking down... :D
Relevant log output
Generate and upload debug_report.log
No response
Without a config.json file (ctrl+f replace your username) and debug_report.log and the name/link to the exact supposedly broken lora, Nero would be guessing. Please make your bug report useful to him so he can actually solve this problem instead of making him do abunch of guesswork.
If you dont know how to provide the first two, let me know.
Well, there is the setting. I cannot provide a debug log as is doesn't exist. I am not sure if I want to abort my now running training trying to produce valid log, sorry I did not copy console log when it was available and I don't have debug mode enabled. The Lora's I am using are not uploaded anywhere and wont be in the state they are. They are not broken, btw, that must have been a misunderstanding. Training a full layer lora for 1 step and then trying to continue it with attention only should yield the same result, as the content doesn't matter but the data structure will be the same and suffice for reproduction.
Do it after the run has finished/tmrw. Dbl click export_debug.bat in the OneTrainer root folder.
Following up on this
Modified to feature request, not a bug in my eyes, the user selected settings larger the resources they had available, it overflowed into pagefile and caused intense system lag because you are overloading the IMC on your CPU and bottlenecked by NAND by being orders of magnitude slower than RAM, let alone VRAM