OneTrainer [Bug]: Cancelling a training session that is using CPU offloading reliably soft bricks further training. Must close and reopen

What happened?

Pressed stop
Waiting for it to stop
adjusted batch size (also tried it another time when I was changing model data type)
Pressed start
Training could not proceed due to CUDA error, I suspect due to something not being cleaned up properly.

This has a 100% repro rate for me. Tried four seperate times whilst testing claims of XYZ not working in the discord.

What did you expect would happen?

That I would be able to continue training with new settings after waiting for training to stop

Relevant log output

TensorBoard 2.18.0 at http://localhost:6006/ (Press CTRL+C to quit)
Traceback (most recent call last):
  File "C:\repos\OneTrainer\modules\ui\TrainUI.py", line 560, in __training_thread_function
    trainer.start()
  File "C:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 135, in start
    self.model_setup.setup_train_device(self.model, self.config)
  File "C:\repos\OneTrainer\modules\modelSetup\StableDiffusion3LoRASetup.py", line 245, in setup_train_device
    model.transformer_to(self.train_device)
  File "C:\repos\OneTrainer\modules\model\StableDiffusion3Model.py", line 164, in transformer_to
    self.transformer_offload_conductor.to(device)
  File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 508, in to
    offload_quantized(module, self.__temp_device, allocator=allocator.allocate_like)
  File "C:\repos\OneTrainer\modules\util\quantization_util.py", line 184, in offload_quantized
    tensor = allocator(module.weight)
  File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 84, in allocate_like
    self.__layer_allocator.ensure_allocation(cache_tensor_index)
  File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 201, in ensure_allocation
    pin_tensor_(self.cache_tensors[cache_tensor_index])
  File "C:\repos\OneTrainer\modules\util\torch_util.py", line 190, in pin_tensor_
    raise RuntimeError(f"CUDA Error while trying to pin memory. error: {err.value}, ptr: {x.data_ptr()}, size: {x.numel() * x.element_size()}")
RuntimeError: CUDA Error while trying to pin memory. error: 712, ptr: 2575826944256, size: 1653732967

Output of `pip freeze`

absl-py==2.1.0
accelerate==1.0.1
aiohappyeyeballs==2.4.3
aiohttp==3.11.2
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
async-timeout==5.0.1
attrs==24.2.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.4.0
cloudpickle==3.1.0
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.1
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
-e git+https://github.com/huggingface/diffusers.git@e45c25d03aeb0a967d8aaa0f6a79f280f6838e1f#egg=diffusers
filelock==3.16.1
flatbuffers==24.3.25
fonttools==4.55.0
frozenlist==1.5.0
fsspec==2024.10.0
ftfy==6.3.1
grpcio==1.68.0
huggingface-hub==0.26.2
humanfriendly==10.0
idna==3.10
importlib_metadata==8.5.0
invisible-watermark==0.2.0
Jinja2==3.1.4
kiwisolver==1.4.7
lightning-utilities==0.11.8
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@f9edb99bea18da54440c4600894027706b5172ce#egg=mgds
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
numpy==1.26.4
nvidia-ml-py==12.560.30
omegaconf==2.3.0
onnxruntime-gpu==1.19.2
open_clip_torch==2.28.0
opencv-python==4.10.0.84
packaging==24.2
pillow==11.0.0
platformdirs==4.3.6
pooch==1.8.2
prodigyopt==1.0
propcache==0.2.0
protobuf==5.28.3
psutil==6.1.0
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
pyparsing==3.2.0
pyreadline3==3.5.4
python-dateutil==2.9.0.post0
pytorch-lightning==2.4.0
pytorch_optimizer==3.1.2
PyWavelets==1.7.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.2.7
sentencepiece==0.2.0
six==1.16.0
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.11
tokenizers==0.20.3
torch==2.5.1+cu124
torchmetrics==1.6.0
torchvision==0.20.1+cu124
tqdm==4.66.6
transformers==4.46.0
typing_extensions==4.12.2
urllib3==2.2.3
wcwidth==0.2.13
Werkzeug==3.1.3
xformers==0.0.28.post3
yarl==1.17.1
zipp==3.21.0```

Nov 17 '24 11:11 O-J1

Suggested solution from discord: "" I was able to resolve #574 by fixing the library paths and adding this to my venv activate: export LD_LIBRARY_PATH="$ONETRAINER_PATH/venv/lib/python3.12/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH" ""

Feb 15 '25 13:02 dxqb

Suggested solution from discord: "" I was able to resolve #574 by fixing the library paths and adding this to my venv activate: export LD_LIBRARY_PATH="$ONETRAINER_PATH/venv/lib/python3.12/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH" ""

@Arcitec Thoughts on this ^? Should we add it?

Mar 04 '25 03:03 O-J1

Please see here: https://github.com/Nerogar/OneTrainer/issues/820

I'm not merging the issues because I'm not sure this is related. I narrowed it down to a cause, but I didn't start a 2nd time, it happens on the 1st time.

You also have a quite large tensor pin in your log output above though (1.5 GB). So it could be related

Apr 29 '25 06:04 dxqb

[Bug]: Cancelling a training session that is using CPU offloading reliably soft bricks further training. Must close and reopen

What happened?

What did you expect would happen?

Relevant log output

Output of pip freeze

Output of `pip freeze`