OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug]: Double the ram usage after upgrading to Torch 2.8

Open Ulexer opened this issue 1 month ago • 13 comments

What happened?

After updating to the latest commit I got OOM, trying to train Lora for Qwen Image bf16 with 1.0 cpu offload. During caching model loaded as usual, with 80gb of ram used. Then, once actual steps started, ram usage went up, before crashing OneTrainer. I tried manually reverting to the commit that worked before, but got OOM again. After downgrading Torch to 2.7.1 the issue disappeared.

What did you expect would happen?

Not getting OOM with the same config that worked before

Relevant log output

Starting UI...
C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 1173.66it/s]
TensorFlow installation not found - running with reduced feature set.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.19.0 at http://localhost:6006/ (Press CTRL+C to quit)
The config attributes {'pooled_projection_dim': 768} were passed to QwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Selected layers: 720
Deselected layers: 126
Note: Enable Debug mode to see the full list of layer names
Exception in thread Reloader:
Traceback (most recent call last):
  File "C:\Users\ulexe\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "C:\Users\ulexe\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\data_ingester.py", line 108, in _reload
    self._multiplexer.Reload()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 263, in Reload
    Worker()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 241, in Worker
    accumulator.Reload()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_accumulator.py", line 202, in Reload
    for event in self._generator.Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 88, in Load
    for event in self._LoadInternal():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 118, in _LoadInternal
    for event in self._loader.Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 270, in Load
    for event in super().Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 244, in Load
    for record in super().Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 178, in Load
    yield next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 109, in __next__
    self._reader.GetNext()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\pywrap_tensorflow.py", line 207, in GetNext
    header_str = self._read(8)
                 ^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\pywrap_tensorflow.py", line 273, in _read
    new_data = self.file_handle.read(n)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\io\gfile.py", line 736, in read
    (self.buff, self.continuation_token) = self.fs.read(
                                           ^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\io\gfile.py", line 141, in read
    data = f.read(size)
           ^^^^^^^^^^^^
MemoryError

Generate and upload debug_report.log

No response

Ulexer avatar Nov 14 '25 20:11 Ulexer

cannot reproduce. 34 GB RAM during caching, 38 GB ram during training, using the qwen 16 GB full finetuning preset but offloading fraction 1.0 as you have described. please try to reproduce using the preset. if it doens't happen then, please post your config.

dxqb avatar Nov 15 '25 00:11 dxqb

I was getting this issue with Lora training, not finetune. Though the same happens on Qwen 16 GB full finetuning default preset with offload set to 1.0.

Torch 2.7.1 - training works fine Image

Torch 2.8 - during "running model setup" OneTrainer crashes Image

Only Torch and Torchvision versions were different, tested with the same commit and config.

Ulexer avatar Nov 15 '25 10:11 Ulexer

cannot reproduce. 34 GB RAM during caching, 38 GB ram during training, using the qwen 16 GB full finetuning preset but offloading fraction 1.0 as you have described. please try to reproduce using the preset. if it doens't happen then, please post your config.

dxqb avatar Nov 15 '25 10:11 dxqb

Above images are from default preset

Ulexer avatar Nov 15 '25 10:11 Ulexer

Above images are from default preset

you've mentioned you are using bf16 and offloading fraction of 1.0 that is both not the preset. can you just post you config please

dxqb avatar Nov 15 '25 10:11 dxqb

Sure config.json

Above testing was done with "#qwen Finetune 16GB", which has bfloat16 for prior and float8 for text encoder by default, with offloading changed to 1.0. OOM happens with both presets

Ulexer avatar Nov 15 '25 10:11 Ulexer

Please try with the actual unmodified preset instead, so that we have a control. It only needs to be for a single epoch.

O-J1 avatar Nov 15 '25 11:11 O-J1

Please try with the actual unmodified preset instead, so that we have a control. It only needs to be for a single epoch.

I did. I got OOM with "#qwen Finetune 16GB", unmodified except for changing cpu offload from 0.75 to 1.0. It crashed before caching, during "running model setup", with Torch 2.8. Same preset worked fine with Torch 2.7.1, I let it run for 5 steps before stopping.

Ulexer avatar Nov 15 '25 11:11 Ulexer

Sure config.json

Above testing was done with "#qwen Finetune 16GB", which has bfloat16 for prior and float8 for text encoder by default, with offloading changed to 1.0. OOM happens with both presets

with your config, about 67 GB system RAM+swap during caching, about 81 GB system RAM+swap during training (total OS values, not only OneTrainer) At your settings, this is not surprising: with both the transformer and the TE at bf16, the raw model size 56 GB.

If you have a "crash" please elaborate what that means and post an error log. your RAM wasn't even full. if you see a difference between torch 2.7.1 and torch 2.8.0, can you please try to reproduce this at more reasonable settings? I have 64 GB ram and prefer not to do multiple tests at these settings. If you can show that the preset uses more RAM at torch 2.8 I'll look into it.

dxqb avatar Nov 15 '25 11:11 dxqb

I realized it may not be clear from the picture, but where it's flat my PC is frozen for a while and then Windows kills OneTrainer because it probably requested more ram than I have ram+swap

Image

Error log was posted in the very first message, it's same for all crashes.

I will try Lora training at float8 then

Ulexer avatar Nov 15 '25 12:11 Ulexer

2.8 model loading Image

2.7.1 model loading Image

2.8 training Image

2.7.1 training Image

Lora training with transformer and text encoder at float8 17gb more used during training on Torch 2.8

Ulexer avatar Nov 15 '25 12:11 Ulexer

cannot reproduce. 39 GB during training, never above 40 GB during loading. that's total OS values again. No discernable difference between torch 2.8 and torch 2.7.1

If this is an issue introduced by torch 2.8, it must be Windows-only or in some other way specific to your system.

dxqb avatar Nov 15 '25 13:11 dxqb

Using the default profile -> no issues, both Windows and Linux. There is no VRAM usage increase with torch2.8. Just a small bump speed in training.

djp3k05 avatar Nov 15 '25 14:11 djp3k05