The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1
Hi, I'm trying to train a low noise WAN lora local on a 5090 using images
I'm getting an error on sampling
RuntimeError: The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1
Anyone know what to do?
I have the same issue for several days. I have redownload models from different PC . Windows 11 / Linux arch ; 3090 and 5090 ! The results are the same, tensor mismatched. Please help I really don't understand what is going on. I'm used the same setting as in video tutorial
Hi, I'm trying to train a low noise WAN lora local on a 5090 using images
I'm getting an error on sampling
RuntimeError: The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1
Anyone know what to do?
bypass image sampling for now and Lora will train fine
Hi, I'm trying to train a low noise WAN lora local on a 5090 using images I'm getting an error on sampling RuntimeError: The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1 Anyone know what to do?
bypass image sampling for now and Lora will train fine
I got the same error with default(ish) settings. Bypassing image sampling got past the error as well. (Also on a 5090.)
Nice! Did a pull this morning and now its generating sample images again! Thanks for the quick turnaround! (Training up some Wan 2.2 Loras now using it - trying to work out which models to train against - Wan 2.2 14B or Wan 2.2 I2V 14B, especially now the FUN controlnet models are out)
Nice! Did a pull this morning and now its generating sample images again! Thanks for the quick turnaround! (Training up some Wan 2.2 Loras now using it - trying to work out which models to train against - Wan 2.2 14B or Wan 2.2 I2V 14B, especially now the FUN controlnet models are out)
Did you try to train i2v for charactel lora from images dataset ? My i2v lora ignores in ComfyUI... I'l try today with t2v
Did you try to train i2v for charactel lora from images dataset ? My i2v lora ignores in ComfyUI... I'l try today with t2v
This is my second attempt at using Loras, first time with Wan 2.2, my workflow could be broken, but...
- AI Toolkit "Wan 2.2 14B" with 14B T2V generated an image using my Lora (yay!)
- AI Toolkit "Wan 2.2 14B" with 14B I2V failed to follow the Lora
- AI Toolkit "Wan 2.2 14B" with 14B FUN Control failed to follow the Lora (but not sure if I need to train Lora against FUN Control?)
- AI Toolkit "Wan 2.2 14B (I2V)" failed with all models I tried
So the only combination I have succeeded with is T2V using the AI Toolkit Wan 2.2 14B model - so far nothing else has worked for me.
(PS: unrelated this bug and why am I doing the above? I was planning on using I2V to generate videos, and was hoping Loras would improve consistency. But also I was trying to see if I can position characters in a scene using Wan 2.2 so I can use the same Loras to create the initial image to feed into I2V, e.g. using dwpose to put characters in front of a background image I provide. I want consistent backgrounds between shots as well as consistent characters in the shots - I can use Flux, but was trying to do it all in one model to avoid messing around. Going to hunt around for other people's workflows I can download and try in case I have settings wrong.)
UPDATE: Setting the high weight to 3 and low weight to 1.5, I did see that FUN ControlNet listened to my Lora trained on "Wan 2.2 14B", kind of. The result was not useful, but it was clearly from my trained Lora. Trying the ComfyUI supplied workflow, again it was doing something but the quality was terrible. Not sure if due to reference image or Lora. Tried with and without Lightx Loras - no difference.
Same as OP, but using Runpod to train high and low with an RTX 6000. On advice here, X'ed out the samples. After the first checkpoint, I paused to add a single sample with attached image (did not attach images before). After a few moments, training has resumed. First video as expected looks bad at step 500, but at least it is training! Yay!
What’s happening (in plain English)
For Wan I2V, the latents have:
16 channels of “normal” latents
- extra conditioning channels (first-frame stuff), for a total of 36 channels.
In wan22_pipeline.py:
They split your 36-channel latents into:
latents → first 16 channels
conditioning → remaining channels
Then they call the model with [latents, conditioning] concatenated (36 channels) ✅
But they call the scheduler step with:
sample = latents (16 channels)
model_output = noise_pred (36 channels)
That hits x0_pred = sample - sigma_t * model_output in the scheduler → boom:
The size of tensor a (36) must match the size of tensor b (16)
So: model output and latents don’t match, and PyTorch complains. This matches the open/closed issues on the repo for Wan 2.2 I2V.
Open:
ai-toolkit\extensions_built_in\diffusion_models\wan22\wan22_pipeline.py
Find the denoising loop where this line exists (near the end of call):
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
Change it so that when conditioning is used (i.e. I2V / 36-channel case), you only feed the latent part (first 16 channels) into the scheduler:
# after noise_pred is fully computed (and CFG applied), before scheduler.step:
if conditioning is not None:
# keep only the part corresponding to the 16 latent channels
noise_pred = noise_pred[:, :latents.shape[1], ...]
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
That way:
noise_pred → shape matches latents (16 channels)
Extra conditioning channels are only used inside the transformer, not in the scheduler math
The runtime error should disappear.
Save the file and rerun your job.
What’s happening (in plain English)
For Wan I2V, the latents have:
16 channels of “normal” latents
- extra conditioning channels (first-frame stuff), for a total of 36 channels.
In wan22_pipeline.py:
They split your 36-channel latents into:
latents → first 16 channels
conditioning → remaining channels
Then they call the model with [latents, conditioning] concatenated (36 channels) ✅
But they call the scheduler step with:
sample = latents (16 channels)
model_output = noise_pred (36 channels)
That hits x0_pred = sample - sigma_t * model_output in the scheduler → boom:
The size of tensor a (36) must match the size of tensor b (16)
So: model output and latents don’t match, and PyTorch complains. This matches the open/closed issues on the repo for Wan 2.2 I2V.
Open:
ai-toolkit\extensions_built_in\diffusion_models\wan22\wan22_pipeline.py
Find the denoising loop where this line exists (near the end of call):
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]Change it so that when conditioning is used (i.e. I2V / 36-channel case), you only feed the latent part (first 16 channels) into the scheduler:
# after noise_pred is fully computed (and CFG applied), before scheduler.step: if conditioning is not None: # keep only the part corresponding to the 16 latent channels noise_pred = noise_pred[:, :latents.shape[1], ...] latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]That way:
noise_pred → shape matches latents (16 channels)
Extra conditioning channels are only used inside the transformer, not in the scheduler math
The runtime error should disappear.
Save the file and rerun your job.
Hi @rrademacher , tried this changes, but same error
What’s happening (in plain English)
For Wan I2V, the latents have:
16 channels of “normal” latents
- extra conditioning channels (first-frame stuff), for a total of 36 channels.
In wan22_pipeline.py:
They split your 36-channel latents into:
latents → first 16 channels
conditioning → remaining channels
Then they call the model with [latents, conditioning] concatenated (36 channels) ✅
But they call the scheduler step with:
sample = latents (16 channels)
model_output = noise_pred (36 channels)
That hits x0_pred = sample - sigma_t * model_output in the scheduler → boom:
The size of tensor a (36) must match the size of tensor b (16)
So: model output and latents don’t match, and PyTorch complains. This matches the open/closed issues on the repo for Wan 2.2 I2V.
Open:
ai-toolkit\extensions_built_in\diffusion_models\wan22\wan22_pipeline.py
Find the denoising loop where this line exists (near the end of call):
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]Change it so that when conditioning is used (i.e. I2V / 36-channel case), you only feed the latent part (first 16 channels) into the scheduler:
# after noise_pred is fully computed (and CFG applied), before scheduler.step: if conditioning is not None: # keep only the part corresponding to the 16 latent channels noise_pred = noise_pred[:, :latents.shape[1], ...] latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]That way:
noise_pred → shape matches latents (16 channels)
Extra conditioning channels are only used inside the transformer, not in the scheduler math
The runtime error should disappear.
Save the file and rerun your job.
Same here, I've tried those changes still the same problem. It seems the I2V model needs conditioning for the image and the text, 16 channels each one, 32 total. But the training code for some reason does not pick up this configuration properly.
Hey I just got this as well on a a6000 pro trying wan 2.2 .. I used the default template on runpod you created.. any idea why this is happening?
ill try the skip samples .. im only training images
Seems to run if you disable sampling. But that's less than ideal.