ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Possible NPE with 2 latest commits

Open TheIronDev opened this issue 1 year ago • 1 comments
trafficstars

Howdy!

I generated a lora this morning, and was successfully able to get 3000 steps trained.

I saw the 2 new patches come in, so I did a git pull, and re-ran the same training config.

After 1867 steps I encountered the following:

my-lora:  62%| 1867/3000 [1:15:56<59:02,  3.13s/it, lr: 1.0e-04 loss: 2.105e-01]

Error running job: <class 'weakref.ReferenceType'> returned NULL without setting an exception

========================================
Result:
 - 0 completed jobs
 - 1 failure
========================================
Traceback (most recent call last):
  File "D:\AI-Programs\ai-toolkit\run.py", line 90, in <module>
    main()
  File "D:\AI-Programs\ai-toolkit\run.py", line 86, in main
    raise e
  File "D:\AI-Programs\ai-toolkit\run.py", line 78, in main
    job.run()
  File "D:\AI-Programs\ai-toolkit\jobs\ExtensionJob.py", line 22, in run
    process.run()
  File "D:\AI-Programs\ai-toolkit\jobs\process\BaseSDTrainProcess.py", line 1701, in run
    loss_dict = self.hook_train_loop(batch)
  File "D:\AI-Programs\ai-toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 1520, in hook_train_loop
    self.scaler.scale(loss).backward()
  File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\_tensor.py", line 521, in backward
    torch.autograd.backward(
  File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\autograd\__init__.py", line 289, in backward
    _engine_run_backward(
  File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\autograd\graph.py", line 768, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\utils\checkpoint.py", line 1120, in unpack_hook
    frame.check_recomputed_tensors_match(gid)
  File "D:\AI-Programs\ai-toolkit\venv\lib\site-packages\torch\utils\checkpoint.py", line 880, in check_recomputed_tensors_match
    _internal_assert(holder.handles[gid] in self.recomputed[gid])
  File "C:\Users\their\AppData\Local\Programs\Python\Python310\lib\weakref.py", line 457, in __contains__
    wr = ref(key)
SystemError: <class 'weakref.ReferenceType'> returned NULL without setting an exception

I'm going to try again shortly, but wanted to open an issue just in case there was a regression.

TheIronDev avatar Aug 14 '24 21:08 TheIronDev

I ran the same script and the error did not repro.

TheIronDev avatar Aug 15 '24 04:08 TheIronDev

Odd, I have never seen this before. I am going to close for now since it did not reproduce. Please reopen if it happens again and attach any additional details you see.

jaretburkett avatar Aug 19 '24 03:08 jaretburkett