[Bug]: Cancelling a training session that is using CPU offloading reliably soft bricks further training. Must close and reopen
What happened?
- Pressed stop
- Waiting for it to stop
- adjusted batch size (also tried it another time when I was changing model data type)
- Pressed start
- Training could not proceed due to CUDA error, I suspect due to something not being cleaned up properly.
This has a 100% repro rate for me. Tried four seperate times whilst testing claims of XYZ not working in the discord.
What did you expect would happen?
That I would be able to continue training with new settings after waiting for training to stop
Relevant log output
TensorBoard 2.18.0 at http://localhost:6006/ (Press CTRL+C to quit)
Traceback (most recent call last):
File "C:\repos\OneTrainer\modules\ui\TrainUI.py", line 560, in __training_thread_function
trainer.start()
File "C:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 135, in start
self.model_setup.setup_train_device(self.model, self.config)
File "C:\repos\OneTrainer\modules\modelSetup\StableDiffusion3LoRASetup.py", line 245, in setup_train_device
model.transformer_to(self.train_device)
File "C:\repos\OneTrainer\modules\model\StableDiffusion3Model.py", line 164, in transformer_to
self.transformer_offload_conductor.to(device)
File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 508, in to
offload_quantized(module, self.__temp_device, allocator=allocator.allocate_like)
File "C:\repos\OneTrainer\modules\util\quantization_util.py", line 184, in offload_quantized
tensor = allocator(module.weight)
File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 84, in allocate_like
self.__layer_allocator.ensure_allocation(cache_tensor_index)
File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 201, in ensure_allocation
pin_tensor_(self.cache_tensors[cache_tensor_index])
File "C:\repos\OneTrainer\modules\util\torch_util.py", line 190, in pin_tensor_
raise RuntimeError(f"CUDA Error while trying to pin memory. error: {err.value}, ptr: {x.data_ptr()}, size: {x.numel() * x.element_size()}")
RuntimeError: CUDA Error while trying to pin memory. error: 712, ptr: 2575826944256, size: 1653732967
Output of pip freeze
absl-py==2.1.0
accelerate==1.0.1
aiohappyeyeballs==2.4.3
aiohttp==3.11.2
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
async-timeout==5.0.1
attrs==24.2.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.4.0
cloudpickle==3.1.0
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.1
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
-e git+https://github.com/huggingface/diffusers.git@e45c25d03aeb0a967d8aaa0f6a79f280f6838e1f#egg=diffusers
filelock==3.16.1
flatbuffers==24.3.25
fonttools==4.55.0
frozenlist==1.5.0
fsspec==2024.10.0
ftfy==6.3.1
grpcio==1.68.0
huggingface-hub==0.26.2
humanfriendly==10.0
idna==3.10
importlib_metadata==8.5.0
invisible-watermark==0.2.0
Jinja2==3.1.4
kiwisolver==1.4.7
lightning-utilities==0.11.8
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@f9edb99bea18da54440c4600894027706b5172ce#egg=mgds
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
numpy==1.26.4
nvidia-ml-py==12.560.30
omegaconf==2.3.0
onnxruntime-gpu==1.19.2
open_clip_torch==2.28.0
opencv-python==4.10.0.84
packaging==24.2
pillow==11.0.0
platformdirs==4.3.6
pooch==1.8.2
prodigyopt==1.0
propcache==0.2.0
protobuf==5.28.3
psutil==6.1.0
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
pyparsing==3.2.0
pyreadline3==3.5.4
python-dateutil==2.9.0.post0
pytorch-lightning==2.4.0
pytorch_optimizer==3.1.2
PyWavelets==1.7.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.2.7
sentencepiece==0.2.0
six==1.16.0
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.11
tokenizers==0.20.3
torch==2.5.1+cu124
torchmetrics==1.6.0
torchvision==0.20.1+cu124
tqdm==4.66.6
transformers==4.46.0
typing_extensions==4.12.2
urllib3==2.2.3
wcwidth==0.2.13
Werkzeug==3.1.3
xformers==0.0.28.post3
yarl==1.17.1
zipp==3.21.0```
Suggested solution from discord: "" I was able to resolve #574 by fixing the library paths and adding this to my venv activate: export LD_LIBRARY_PATH="$ONETRAINER_PATH/venv/lib/python3.12/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH" ""
Suggested solution from discord: "" I was able to resolve #574 by fixing the library paths and adding this to my venv activate: export LD_LIBRARY_PATH="$ONETRAINER_PATH/venv/lib/python3.12/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH" ""
@Arcitec Thoughts on this ^? Should we add it?
Please see here: https://github.com/Nerogar/OneTrainer/issues/820
I'm not merging the issues because I'm not sure this is related. I narrowed it down to a cause, but I didn't start a 2nd time, it happens on the 1st time.
You also have a quite large tensor pin in your log output above though (1.5 GB). So it could be related