Wan lora or VACE lora?
I want to use VACE for video stylization, add dynamic elements to the video, and basically preserve the structural features of the original video. If I use depth or edge information for control, the model needs to generate elements like floating flowers or falling leaves in areas without control signals. However, I tried it and found it's not very stable—sometimes it can generate such elements, and sometimes it can't. Regarding this situation, should I train VACE or the LoRA of the WAN itself?
I don't think you need to train a lora for what you're trying to achieve. How to do it depends on scenarios though.
- If you just want those flowers and leaves as visual effects, and do not need them to blend into the scene, use traditional editing softwares and add them as separate layers.
- If you have existing video that you want to edit or you don't mind first generating a version without the effects, you can use the inpainting feature of VACE and just pass the whole video in. Do note that in this case you'll need to prompt it with an accurate description of the scene plus the falling flowers and leaves (combine them into a description, not editing command like what you do with image editing models), else the scene will change to reflect the inaccurate prompt.
- If you want to do it in one pass and don't care too much about the details of the control, i.e. the control is just there to guide the structure of the scene, not the specific details of each element in the scene, you can do it this way: first run VACE with the prompt and control for maybe half or one forth of the targeted steps (for 20 steps it would be 5 to 10), then switch to Wan only and do the rest of the steps with prompt only. You might need to tweak how many steps you want to run with VACE to achieve stable enough scene composition but still allow Wan to add new elements into the scene. And if you choose this route Do Not Use Any Fast Loras, as those loras are too rigid for you to do such things. Alternatively, you can try creating a control signals for the leaves and flowers and combine them into one (just stack the two control signals on top of each other) and VACE will be able to do it in one go.
So basically VACE works just like the old controlnets, a separate network that injects cross attention signals into the main model, Wan in this case. Since you gave it a control signal, it will try to steer the Wan model to follow the control signal as close as possible. If your control signal (i.e. your depth map or edge map) does not contain such information, for most people there shouldn't be anything there, thus asking the model to generate things at somewhere without control signals will be unstable, and is probably not a recommended usage.
I don't think you need to train a lora for what you're trying to achieve. How to do it depends on scenarios though.
- If you just want those flowers and leaves as visual effects, and do not need them to blend into the scene, use traditional editing softwares and add them as separate layers.
- If you have existing video that you want to edit or you don't mind first generating a version without the effects, you can use the inpainting feature of VACE and just pass the whole video in. Do note that in this case you'll need to prompt it with an accurate description of the scene plus the falling flowers and leaves (combine them into a description, not editing command like what you do with image editing models), else the scene will change to reflect the inaccurate prompt.
- If you want to do it in one pass and don't care too much about the details of the control, i.e. the control is just there to guide the structure of the scene, not the specific details of each element in the scene, you can do it this way: first run VACE with the prompt and control for maybe half or one forth of the targeted steps (for 20 steps it would be 5 to 10), then switch to Wan only and do the rest of the steps with prompt only. You might need to tweak how many steps you want to run with VACE to achieve stable enough scene composition but still allow Wan to add new elements into the scene. And if you choose this route Do Not Use Any Fast Loras, as those loras are too rigid for you to do such things. Alternatively, you can try creating a control signals for the leaves and flowers and combine them into one (just stack the two control signals on top of each other) and VACE will be able to do it in one go.
So basically VACE works just like the old controlnets, a separate network that injects cross attention signals into the main model, Wan in this case. Since you gave it a control signal, it will try to steer the Wan model to follow the control signal as close as possible. If your control signal (i.e. your depth map or edge map) does not contain such information, for most people there shouldn't be anything there, thus asking the model to generate things at somewhere without control signals will be unstable, and is probably not a recommended usage.
Thanks for your reply. If during the model forward pass, you adjust the context_scale to 0 or a relatively small value, it is equivalent to allowing the WAN model to exert more of its generation capability. Is this understanding correct? During the forward pass, following your suggestion, I used a control strength of 1 in the earlier steps, and reduced it or set it directly to 0 in the later steps. However, I noticed that with this setup, the generated result has an issue: the contrast of the first few frames of the video is very strange and looks abnormal. Could you please explain what might be causing this situation?
Thanks for your reply. If during the model forward pass, you adjust the context_scale to 0 or a relatively small value, it is equivalent to allowing the WAN model to exert more of its generation capability. Is this understanding correct? During the forward pass, following your suggestion, I used a control strength of 1 in the earlier steps, and reduced it or set it directly to 0 in the later steps. However, I noticed that with this setup, the generated result has an issue: the contrast of the first few frames of the video is very strange and looks abnormal. Could you please explain what might be causing this situation?
- Since I mainly use ComfyUI, I am not particularly sure about what
context_scaledoes in this repo. But judging from a quick glance of the repo I think it is true that it controls the strength of the control signals, so setting it to 1 at first and lowering it to 0 later should do what you're trying to achieve. - I'm not sure whether you're trying to extend a video or generating a video from control signals, but there are mentions about what you have encountered (weird contrast or blurry output on first few frames). I'm not aware of any solutions or particular tweaks to remedy the issue, as I mostly encounter such issue on video extension, and there's currently no standard ways to fix it (I just simply discard those frames as those frames usually already exists in the clip to extend). You might need to play with the parameters a bit to see if any of them will fix your problem.