ai-toolkit The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1

Hi, I'm trying to train a low noise WAN lora local on a 5090 using images

I'm getting an error on sampling

RuntimeError: The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1

Anyone know what to do?

Aug 24 '25 04:08 roxsvg

I have the same issue for several days. I have redownload models from different PC . Windows 11 / Linux arch ; 3090 and 5090 ! The results are the same, tensor mismatched. Please help I really don't understand what is going on. I'm used the same setting as in video tutorial

Aug 24 '25 09:08 Falkonar

Hi, I'm trying to train a low noise WAN lora local on a 5090 using images

I'm getting an error on sampling

RuntimeError: The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1

Anyone know what to do?

bypass image sampling for now and Lora will train fine

Aug 24 '25 10:08 Falkonar

Hi, I'm trying to train a low noise WAN lora local on a 5090 using images I'm getting an error on sampling RuntimeError: The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1 Anyone know what to do?

bypass image sampling for now and Lora will train fine

I got the same error with default(ish) settings. Bypassing image sampling got past the error as well. (Also on a 5090.)

Aug 24 '25 16:08 alankent

Nice! Did a pull this morning and now its generating sample images again! Thanks for the quick turnaround! (Training up some Wan 2.2 Loras now using it - trying to work out which models to train against - Wan 2.2 14B or Wan 2.2 I2V 14B, especially now the FUN controlnet models are out)

Aug 25 '25 17:08 alankent

Nice! Did a pull this morning and now its generating sample images again! Thanks for the quick turnaround! (Training up some Wan 2.2 Loras now using it - trying to work out which models to train against - Wan 2.2 14B or Wan 2.2 I2V 14B, especially now the FUN controlnet models are out)

Did you try to train i2v for charactel lora from images dataset ? My i2v lora ignores in ComfyUI... I'l try today with t2v

Aug 26 '25 00:08 Falkonar

Did you try to train i2v for charactel lora from images dataset ? My i2v lora ignores in ComfyUI... I'l try today with t2v

This is my second attempt at using Loras, first time with Wan 2.2, my workflow could be broken, but...

AI Toolkit "Wan 2.2 14B" with 14B T2V generated an image using my Lora (yay!)
AI Toolkit "Wan 2.2 14B" with 14B I2V failed to follow the Lora
AI Toolkit "Wan 2.2 14B" with 14B FUN Control failed to follow the Lora (but not sure if I need to train Lora against FUN Control?)
AI Toolkit "Wan 2.2 14B (I2V)" failed with all models I tried

So the only combination I have succeeded with is T2V using the AI Toolkit Wan 2.2 14B model - so far nothing else has worked for me.

(PS: unrelated this bug and why am I doing the above? I was planning on using I2V to generate videos, and was hoping Loras would improve consistency. But also I was trying to see if I can position characters in a scene using Wan 2.2 so I can use the same Loras to create the initial image to feed into I2V, e.g. using dwpose to put characters in front of a background image I provide. I want consistent backgrounds between shots as well as consistent characters in the shots - I can use Flux, but was trying to do it all in one model to avoid messing around. Going to hunt around for other people's workflows I can download and try in case I have settings wrong.)

UPDATE: Setting the high weight to 3 and low weight to 1.5, I did see that FUN ControlNet listened to my Lora trained on "Wan 2.2 14B", kind of. The result was not useful, but it was clearly from my trained Lora. Trying the ComfyUI supplied workflow, again it was doing something but the quality was terrible. Not sure if due to reference image or Lora. Tried with and without Lightx Loras - no difference.

Aug 26 '25 03:08 alankent

Same as OP, but using Runpod to train high and low with an RTX 6000. On advice here, X'ed out the samples. After the first checkpoint, I paused to add a single sample with attached image (did not attach images before). After a few moments, training has resumed. First video as expected looks bad at step 500, but at least it is training! Yay!

Sep 04 '25 02:09 Squishy-Gummy

What’s happening (in plain English)

For Wan I2V, the latents have:

16 channels of “normal” latents

extra conditioning channels (first-frame stuff), for a total of 36 channels.

In wan22_pipeline.py:

They split your 36-channel latents into:

latents → first 16 channels

conditioning → remaining channels

Then they call the model with [latents, conditioning] concatenated (36 channels) ✅

But they call the scheduler step with:

sample = latents (16 channels)

model_output = noise_pred (36 channels)

That hits x0_pred = sample - sigma_t * model_output in the scheduler → boom:

The size of tensor a (36) must match the size of tensor b (16)

So: model output and latents don’t match, and PyTorch complains. This matches the open/closed issues on the repo for Wan 2.2 I2V.

Open:

ai-toolkit\extensions_built_in\diffusion_models\wan22\wan22_pipeline.py

Find the denoising loop where this line exists (near the end of call):

latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]

Change it so that when conditioning is used (i.e. I2V / 36-channel case), you only feed the latent part (first 16 channels) into the scheduler:

# after noise_pred is fully computed (and CFG applied), before scheduler.step:
if conditioning is not None:
    # keep only the part corresponding to the 16 latent channels
    noise_pred = noise_pred[:, :latents.shape[1], ...]

latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]

That way:

noise_pred → shape matches latents (16 channels)

Extra conditioning channels are only used inside the transformer, not in the scheduler math

The runtime error should disappear.

Save the file and rerun your job.

Nov 12 '25 01:11 rrademacher

What’s happening (in plain English)

For Wan I2V, the latents have:

16 channels of “normal” latents

extra conditioning channels (first-frame stuff), for a total of 36 channels.

In wan22_pipeline.py:

They split your 36-channel latents into:

latents → first 16 channels

conditioning → remaining channels

Then they call the model with [latents, conditioning] concatenated (36 channels) ✅

But they call the scheduler step with:

sample = latents (16 channels)

model_output = noise_pred (36 channels)

That hits x0_pred = sample - sigma_t * model_output in the scheduler → boom:

The size of tensor a (36) must match the size of tensor b (16)

So: model output and latents don’t match, and PyTorch complains. This matches the open/closed issues on the repo for Wan 2.2 I2V.

Open:

ai-toolkit\extensions_built_in\diffusion_models\wan22\wan22_pipeline.py

Find the denoising loop where this line exists (near the end of call):

latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]

Change it so that when conditioning is used (i.e. I2V / 36-channel case), you only feed the latent part (first 16 channels) into the scheduler:
# after noise_pred is fully computed (and CFG applied), before scheduler.step:
if conditioning is not None:
    # keep only the part corresponding to the 16 latent channels
    noise_pred = noise_pred[:, :latents.shape[1], ...]

latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
That way:

noise_pred → shape matches latents (16 channels)

Extra conditioning channels are only used inside the transformer, not in the scheduler math

The runtime error should disappear.

Save the file and rerun your job.

Hi @rrademacher , tried this changes, but same error

Nov 13 '25 16:11 mixkey23

What’s happening (in plain English)

For Wan I2V, the latents have:

16 channels of “normal” latents

extra conditioning channels (first-frame stuff), for a total of 36 channels.

In wan22_pipeline.py:

They split your 36-channel latents into:

latents → first 16 channels

conditioning → remaining channels

Then they call the model with [latents, conditioning] concatenated (36 channels) ✅

But they call the scheduler step with:

sample = latents (16 channels)

model_output = noise_pred (36 channels)

That hits x0_pred = sample - sigma_t * model_output in the scheduler → boom:

The size of tensor a (36) must match the size of tensor b (16)

So: model output and latents don’t match, and PyTorch complains. This matches the open/closed issues on the repo for Wan 2.2 I2V.

Open:

ai-toolkit\extensions_built_in\diffusion_models\wan22\wan22_pipeline.py

Find the denoising loop where this line exists (near the end of call):

latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]

Change it so that when conditioning is used (i.e. I2V / 36-channel case), you only feed the latent part (first 16 channels) into the scheduler:
# after noise_pred is fully computed (and CFG applied), before scheduler.step:
if conditioning is not None:
    # keep only the part corresponding to the 16 latent channels
    noise_pred = noise_pred[:, :latents.shape[1], ...]

latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
That way:

noise_pred → shape matches latents (16 channels)

Extra conditioning channels are only used inside the transformer, not in the scheduler math

The runtime error should disappear.

Save the file and rerun your job.

Same here, I've tried those changes still the same problem. It seems the I2V model needs conditioning for the image and the text, 16 channels each one, 32 total. But the training code for some reason does not pick up this configuration properly.

Nov 22 '25 16:11 nextl368

Hey I just got this as well on a a6000 pro trying wan 2.2 .. I used the default template on runpod you created.. any idea why this is happening?

ill try the skip samples .. im only training images

Dec 02 '25 10:12 gateway

Seems to run if you disable sampling. But that's less than ideal.

Dec 05 '25 20:12 DavidJBarnes