sd-webui-text2video
sd-webui-text2video copied to clipboard
[Bug]: Tensor size mismatch when trying to generate video of different size
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui
Are you using the latest version of the extension?
- [X] I have the modelscope text2video extension updated to the lastest version and I still have the issue.
What happened?
I tried generating a video with 384x216 dimensions (16:9) aspect ratio basically with my custom trained converted model. However I get the following error:
DDIM sampling: 0%| | 0/50 [00:00<?, ?it/s] Traceback (most recent call last): | 0/50 [00:00<?, ?it/s] File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 27, in run vids_pack = process_modelscope(args_dict) File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 209, in process_modelscope samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale, File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 258, in infer x0 = self.diffusion.ddim_sample_loop( File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1485, in ddim_sample_loop xt = self.ddim_sample(xt, t, model, model_kwargs, clamp, File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1334, in ddim_sample _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1275, in p_mean_variance y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0]) File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 380, in forward x = torch.cat([x, xs.pop()], dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list. Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
This occurs even when using the original model.
Steps to reproduce the problem
- Go to the UI
- Try generating a video with width = 384 and height = 216
What should have happened?
Should be generating a video of the required dimensions.
WebUI and Deforum extension Commit IDs
webui commit id - baf6946e06249c5af9851c60171692c44ef633e0 txt2vid commit id - a44078d1cc6a75f619037a63f3e26a483965b826
Torch version
2.0.1+cu118
What GPU were you using for launching?
NVIDIA A10G - 24GB
On which platform are you launching the webui backend with the extension?
Cloud server (Linux)
Settings
Console logs
################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################
################################################################
Running on ubuntu user
################################################################
################################################################
Repo already cloned, using it as install directory
################################################################
################################################################
python venv already activate: /home/ubuntu/text2vid/stable-diffusion-webui/venv
################################################################
################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc.so.4
Python 3.10.9 (main, Mar 1 2023, 18:23:06) [GCC 11.2.0]
Version: v1.3.2
Commit hash: baf6946e06249c5af9851c60171692c44ef633e0
Installing requirements
Launching Web UI with arguments: --listen
No module 'xformers'. Proceeding without it.
Loading weights [6ce0161689] from /home/ubuntu/text2vid/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Creating model from config: /home/ubuntu/text2vid/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Startup time: 4.4s (import torch: 0.9s, import gradio: 0.9s, import ldm: 0.4s, other imports: 0.8s, load scripts: 0.5s, create ui: 0.6s, gradio launch: 0.1s).
DiffusionWrapper has 859.52 M params.
Applying optimization: Doggettx... done.
Textual inversion embeddings loaded(0):
Model loaded in 1.7s (load weights from disk: 0.2s, create model: 0.9s, apply weights to model: 0.2s, apply half(): 0.1s, move model to device: 0.2s).
text2video — The model selected is: ModelScope
text2video extension for auto1111 webui
Git commit: a44078d1
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Blonde woman walking in a forest, dense foliage, pink leaves', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 40, 'seed': 3586594887, 'scale': 17, 'width': 384, 'height': 216, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 0}
latents torch.Size([1, 4, 40, 27, 48]) tensor(-0.0010, device='cuda:0') tensor(0.9960, device='cuda:0')
DDIM sampling: 0%| | 0/31 [00:00<?, ?it/s]
Traceback (most recent call last): | 0/31 [00:00<?, ?it/s]
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 27, in run
vids_pack = process_modelscope(args_dict)
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 209, in process_modelscope
samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 258, in infer
x0 = self.diffusion.ddim_sample_loop(
File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1485, in ddim_sample_loop
xt = self.ddim_sample(xt, t, model, model_kwargs, clamp,
File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1334, in ddim_sample
_, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp,
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1275, in p_mean_variance
y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0])
File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 380, in forward
x = torch.cat([x, xs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
Additional information
No response
I don't think this is a bug, this is how SD worked before. The problem here it is setting the torch.size to an odd number, in this instance 27. Which is indivisible by 4. Best to use the slider to choose a resolution close to what you need and either crop it or squeeze it. I'm not sure what was changed in SD to support odd sizes, or when the change was implemented exactly.
ie: try to make a 720 wide video
Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': '', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 2563507479, 'scale': 17, 'width': 720, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 0} latents torch.Size([1, 4, 24, 32, 90]) tensor(-0.0008, device='cuda:0') tensor(0.9997, device='cuda:0') DDIM sampling: 0%| | 0/31 [00:00<?, ?it/s] Traceback (most recent call last): | 0/31 [00:00<?, ?it/s] File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\t2v_helpers\render.py", line 24, in run vids_pack = process_modelscope(args_dict) File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\modelscope\process_modelscope.py", line 205, in process_modelscope samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale, File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\modelscope\t2v_pipeline.py", line 253, in infer x0 = self.diffusion.ddim_sample_loop( File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1475, in ddim_sample_loop xt = self.ddim_sample(xt, t, model, model_kwargs, clamp, File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1324, in ddim_sample _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1265, in p_mean_variance y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0]) File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 380, in forward x = torch.cat([x, xs.pop()], dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 24 but got size 23 for tensor number 1 in the list. Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 24 but got size 23 for tensor number 1 in the list.
size is now 90, NG. etc.
Hey looks like you were right. It does work in SD normally though, so I'll check out what the change is and try to implement it in the extension as well.
tl;dr for anyone facing this issue: Make sure your resolutions are divisible by 32