local variable 'loss_dict' referenced before assignment
Trying to train WAN 2.2 I2V for hours with different settings, using a 5090, 512px, rank 16, low vram mode, impossible to get over this error. I'm using the RUNPOD template, the GPU is barely hitting 50% of vram usage when this error pops.
Caching text_embeddings for /app/ai-toolkit/datasets/jih
- Saving text embeddings to disk Caching text embeddings to disk: 100%|##########| 554/554 [00:00<00:00, 12745.43it/s] ***** UNLOADING TEXT ENCODER ***** Embeddings cached to disk. We dont need the text encoder anymore
Skipping first sample due to config setting SS: 0%| | 0/5000 [00:00<?, ?it/s] ################################################
OOM during training step, skipping batch 1/3
################################################ Error running job: local variable 'loss_dict' referenced before assignment
Result:
- 0 completed jobs
- 1 failure
========================================
Traceback (most recent call last):
File "/app/ai-toolkit/run.py", line 120, in
main() File "/app/ai-toolkit/run.py", line 108, in main raise e File "/app/ai-toolkit/run.py", line 96, in main job.run() File "/app/ai-toolkit/jobs/ExtensionJob.py", line 22, in run process.run() File "/app/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2208, in run for key, value in loss_dict.items(): UnboundLocalError: local variable 'loss_dict' referenced before assignment Traceback (most recent call last): File "/app/ai-toolkit/run.py", line 120, in main() File "/app/ai-toolkit/run.py", line 108, in main raise e File "/app/ai-toolkit/run.py", line 96, in main job.run() File "/app/ai-toolkit/jobs/ExtensionJob.py", line 22, in run process.run() File "/app/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2208, in run for key, value in loss_dict.items(): UnboundLocalError: local variable 'loss_dict' referenced before assignment SS: 0%| | 0/5000 [00:18<?, ?it/s]
Probably caused by OOM, try to revert back to commit c6edd71a5bb36f3dffcc8b56ee07cacaee14ab56. See #457.
same problem report for qwen image edit when choosing a large batch, prob oom yet this report
Yeah, same issue with runpod and Wan2.2 I2V lora (and 5090 that seems to work earlier with a similar setup)
Yeah, same issue with runpod and Wan2.2 I2V lora
Try again with the minimum dimensions (256), even if you are on a 5090, this is an out of memory error. I was thinking that 512 would work but no.
Yeah, it was just a simple OOM. More VRAM solved it, as expected.
Yeah, it was just a simple OOM. More VRAM solved it, as expected.
How much VRAM you were using? I have 96GB VRAM not working with latest commit.
Yeah, it was just a simple OOM. More VRAM solved it, as expected.
How much VRAM you were using? I have 96GB VRAM not working with latest commit.
I have switched to 96Gb and tweaked some settings to be on a safe side. Training for 768x768
Yeah, it was just a simple OOM. More VRAM solved it, as expected.
How much VRAM you were using? I have 96GB VRAM not working with latest commit.
I have switched to 96Gb and tweaked some settings to be on a safe side. Training for 768x768
Latest code-base works for you now? Cool! May I also ask for the System RAM you have? Maybe that was my bottleneck. Thanks.
Same issue here WAN I2V 50 step training two files 512x512 on a 5090, OOM on 3 consecutive batches. T2V works fine.
################################################
OOM during training step, skipping batch 1/3
################################################ test: 0%| | 0/50 [01:39<?, ?it/s] ################################################
OOM during training step, skipping batch 2/3
################################################ test: 2%|2 | 1/50 [03:21<1:23:01, 101.66s/it] ################################################
OOM during training step, skipping batch 3/3
################################################ test: 4%|4 | 2/50 [04:28<1:05:10, 81.47s/it] RuntimeError: OOM during training step 3 times in a row, aborting training test: 4%|4 | 2/50 [13:07<5:14:58, 393.72s/it]