ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

local variable 'loss_dict' referenced before assignment

Open PGCRT opened this issue 2 months ago • 9 comments

Trying to train WAN 2.2 I2V for hours with different settings, using a 5090, 512px, rank 16, low vram mode, impossible to get over this error. I'm using the RUNPOD template, the GPU is barely hitting 50% of vram usage when this error pops.

Caching text_embeddings for /app/ai-toolkit/datasets/jih

  • Saving text embeddings to disk Caching text embeddings to disk: 100%|##########| 554/554 [00:00<00:00, 12745.43it/s] ***** UNLOADING TEXT ENCODER ***** Embeddings cached to disk. We dont need the text encoder anymore

Skipping first sample due to config setting SS: 0%| | 0/5000 [00:00<?, ?it/s] ################################################

OOM during training step, skipping batch 1/3

################################################ Error running job: local variable 'loss_dict' referenced before assignment

Result:

  • 0 completed jobs
  • 1 failure ======================================== Traceback (most recent call last): File "/app/ai-toolkit/run.py", line 120, in main() File "/app/ai-toolkit/run.py", line 108, in main raise e File "/app/ai-toolkit/run.py", line 96, in main job.run() File "/app/ai-toolkit/jobs/ExtensionJob.py", line 22, in run process.run() File "/app/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2208, in run for key, value in loss_dict.items(): UnboundLocalError: local variable 'loss_dict' referenced before assignment Traceback (most recent call last): File "/app/ai-toolkit/run.py", line 120, in main() File "/app/ai-toolkit/run.py", line 108, in main raise e File "/app/ai-toolkit/run.py", line 96, in main job.run() File "/app/ai-toolkit/jobs/ExtensionJob.py", line 22, in run process.run() File "/app/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2208, in run for key, value in loss_dict.items(): UnboundLocalError: local variable 'loss_dict' referenced before assignment SS: 0%| | 0/5000 [00:18<?, ?it/s]

PGCRT avatar Oct 16 '25 19:10 PGCRT

Probably caused by OOM, try to revert back to commit c6edd71a5bb36f3dffcc8b56ee07cacaee14ab56. See #457.

DefinitlyEvil avatar Oct 16 '25 22:10 DefinitlyEvil

same problem report for qwen image edit when choosing a large batch, prob oom yet this report

YacratesWyh avatar Oct 21 '25 06:10 YacratesWyh

Yeah, same issue with runpod and Wan2.2 I2V lora (and 5090 that seems to work earlier with a similar setup)

nakedfighter3d avatar Oct 22 '25 22:10 nakedfighter3d

Yeah, same issue with runpod and Wan2.2 I2V lora

Try again with the minimum dimensions (256), even if you are on a 5090, this is an out of memory error. I was thinking that 512 would work but no.

PGCRT avatar Oct 22 '25 22:10 PGCRT

Yeah, it was just a simple OOM. More VRAM solved it, as expected.

nakedfighter3d avatar Oct 23 '25 18:10 nakedfighter3d

Yeah, it was just a simple OOM. More VRAM solved it, as expected.

How much VRAM you were using? I have 96GB VRAM not working with latest commit.

DefinitlyEvil avatar Oct 24 '25 16:10 DefinitlyEvil

Yeah, it was just a simple OOM. More VRAM solved it, as expected.

How much VRAM you were using? I have 96GB VRAM not working with latest commit.

I have switched to 96Gb and tweaked some settings to be on a safe side. Training for 768x768

nakedfighter3d avatar Oct 24 '25 18:10 nakedfighter3d

Yeah, it was just a simple OOM. More VRAM solved it, as expected.

How much VRAM you were using? I have 96GB VRAM not working with latest commit.

I have switched to 96Gb and tweaked some settings to be on a safe side. Training for 768x768

Latest code-base works for you now? Cool! May I also ask for the System RAM you have? Maybe that was my bottleneck. Thanks.

DefinitlyEvil avatar Oct 25 '25 20:10 DefinitlyEvil

Same issue here WAN I2V 50 step training two files 512x512 on a 5090, OOM on 3 consecutive batches. T2V works fine.

################################################

OOM during training step, skipping batch 1/3

################################################ test: 0%| | 0/50 [01:39<?, ?it/s] ################################################

OOM during training step, skipping batch 2/3

################################################ test: 2%|2 | 1/50 [03:21<1:23:01, 101.66s/it] ################################################

OOM during training step, skipping batch 3/3

################################################ test: 4%|4 | 2/50 [04:28<1:05:10, 81.47s/it] RuntimeError: OOM during training step 3 times in a row, aborting training test: 4%|4 | 2/50 [13:07<5:14:58, 393.72s/it]

99-bolt avatar Nov 04 '25 13:11 99-bolt