ai-toolkit
ai-toolkit copied to clipboard
Possible NPE with 2 latest commits
trafficstars
Howdy!
I generated a lora this morning, and was successfully able to get 3000 steps trained.
I saw the 2 new patches come in, so I did a git pull, and re-ran the same training config.
After 1867 steps I encountered the following:
my-lora: 62%| 1867/3000 [1:15:56<59:02, 3.13s/it, lr: 1.0e-04 loss: 2.105e-01]
Error running job: <class 'weakref.ReferenceType'> returned NULL without setting an exception
========================================
Result:
- 0 completed jobs
- 1 failure
========================================
Traceback (most recent call last):
File "D:\AI-Programs\ai-toolkit\run.py", line 90, in <module>
main()
File "D:\AI-Programs\ai-toolkit\run.py", line 86, in main
raise e
File "D:\AI-Programs\ai-toolkit\run.py", line 78, in main
job.run()
File "D:\AI-Programs\ai-toolkit\jobs\ExtensionJob.py", line 22, in run
process.run()
File "D:\AI-Programs\ai-toolkit\jobs\process\BaseSDTrainProcess.py", line 1701, in run
loss_dict = self.hook_train_loop(batch)
File "D:\AI-Programs\ai-toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 1520, in hook_train_loop
self.scaler.scale(loss).backward()
File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\_tensor.py", line 521, in backward
torch.autograd.backward(
File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\autograd\__init__.py", line 289, in backward
_engine_run_backward(
File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\autograd\graph.py", line 768, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\utils\checkpoint.py", line 1120, in unpack_hook
frame.check_recomputed_tensors_match(gid)
File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\utils\checkpoint.py", line 880, in check_recomputed_tensors_match
_internal_assert(holder.handles[gid] in self.recomputed[gid])
File "C:\Users\their\AppData\Local\Programs\Python\Python310\lib\weakref.py", line 457, in __contains__
wr = ref(key)
SystemError: <class 'weakref.ReferenceType'> returned NULL without setting an exception
I'm going to try again shortly, but wanted to open an issue just in case there was a regression.
I ran the same script and the error did not repro.
Odd, I have never seen this before. I am going to close for now since it did not reproduce. Please reopen if it happens again and attach any additional details you see.