OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug]: Insane memory (system RAM) consumption when storing backups

Open gilga2024 opened this issue 9 months ago • 2 comments

What happened?

Not really a bug in the sense of "not working", but insane resource consumption.

When doing SDXL fine tune training with EMA enabled and EMA running on CPU, the process takes about 18-19 GB of system RAM (CPU!, not VRAM). Independent of the selected Optimizer and other settings, this goes up by an insane amount when storing backups during training. I imposed a memory limit of 28GB to the process (in order to not cause core dumps due to OOM on system RAM) and the process crashes due to out of memory. Hence, saving the backup takes at least about 10GB of additional system RAM. Given that the whole model when saved is less than 8GB this sounds insane.

When EMA is completely disabled, system memory consumption with the very same settings is at 15-17 GB and goes up "only" by about 6-7 GB when storing the backup (I saw a maximum of 24GB of system RAM consumed). My guess is that all data written during the backup process is first copied/stored in RAM and then written to disk. I am also guessing that when EMA is enabled this amount gets at least doubled.

The key point is: Given that all data needed should already be in RAM/VRAM at that point in time, it should be possible to just "stream" it to disk with nearly no additional system memory being consumed during the backup/save process.

Why is this important? Well, this does not haunt people with >32GB of system RAM. But since we go a long way to get the training process to consume as less resources as possible (mostly VRAM on GPU) it sounds strange that for some people training will fail after it actually was completed but a backup is stored / final model is written to disk. Essentially it makes SDXL training with EMA enabled impossible for people with only "small" amounts of system RAM (32GB are not enough), independent of the amount of available VRAM and probably also causes a lot of trouble in case on has "only" 16 GB of memory.

What did you expect would happen?

RAM consumption should only increase slightly when writing backups/saving the final model or intermediate states.

Relevant log output

No response

Output of pip freeze

No response

gilga2024 avatar May 10 '24 16:05 gilga2024