New Image Edit Model based on Wan
Nvidia published https://huggingface.co/nvidia/ChronoEdit-14B-Diffusers, an image edit model like qwen edit. Basically, it‘s Wan 2.1 I2V with either 2 or 23 frames and the last frame is the edited image. I think, there is no modification needed, but maybe a example workflow as long as the model has correct layer names.
oh I saw this earlier. Thought it was an image model. Might be worth a try see if it works for a few frames ;-)
It looks really nice, no image degradation. This is a big plus. I will install later and play with it a little bit.
Seems to work out of the box after converting the model from diffusers, though I have no idea if this is the correct way to use it:
@kijai could you try feeding the output back into the input several times with different tweaks, and check whether any degradation actually occurs?
Does changing scene work https://research.nvidia.com/labs/toronto-ai/chronoedit/assets/video_examples/21.mp4
and camera control: https://research.nvidia.com/labs/toronto-ai/chronoedit/assets/video_examples/9.mp4
thought those 2 use cases looked pretty good. But i'm sure its cherry picked as well ;-)
Speaking of camera control i saw some candy on Kijais huggingface as well ;-)
Don't have time right now to test, but I uploaded the converted models:
fp16 and the distill lora:
https://huggingface.co/Kijai/WanVideo_comfy/blob/main/ChronoEdit/
fp8_scaled:
https://huggingface.co/Kijai/WanVideo_comfy_fp8_scaled/tree/main/ChronoEdit
Don't have time right now to test, but I uploaded the converted models:
oh nice.. will give it a try ;-)
Cool, thanks, man.
Think it might work as advertised ;-) could somewhat reproduce some of their examples. But only did a few tests
It looks really nice, no image degradation. This is a big plus. I will install later and play with it a little bit.
Got me curious, even though its an image model... Not sure if its the best way or not, but did try a long run with regular context window. Seems to hold up pretty well ... could be some vace ish workflow a better test.
https://github.com/user-attachments/assets/65282587-05c1-4278-96f9-67ba99981c3e
(the only thing i noticed was that the movements seems a bit rapid, but only did a few tests)
Seems to work out of the box after converting the model from diffusers, though I have no idea if this is the correct way to use it:
![]()
Is there a way to generate only 2 frames? Since frames 2-5 are practically no different, but they take time
Is there a way to generate only 2 frames? Since frames 2-5 are practically no different, but they take time
Since two is the minimum for I2V, it should be no problem. Just set the frame count to 2 instead of 5.
Is there a way to generate only 2 frames? Since frames 2-5 are practically no different, but they take time
Since two is the minimum for I2V, it should be no problem. Just set the frame count to 2 instead of 5.
Is this possible with kijai nodes? The last time I tried to install this it didn't support my video card (2000 series). And the comfyui base node "WanImageToVideo" only supports 1 or 5 frames. If kijai node supports 2 frames, I'll install it, thanks for the help.
@NarutoHokageSaskeUchihaSuperItachiMan
WanVideo ImageToVideo is same as the native one in regards to number of frames.. 1, 5, 9, etc...
but i think you need 5.. sometimes even a few more, if thats what it takes for the Wan to do the change needed. Say for example make a person turn around for a different camera view. Didnt try myself, but saw some Youtube videos testing the models, say that
That being said, on a 2000 gpu, you could try the much smaller GGUF version see if that works better for you
https://huggingface.co/QuantStack/ChronoEdit-14B-GGUF
(for example Q4 is pretty good, and all above are good )
(and make sure to use the block swap node, to adjust the use between Vram and regular Ram)
If I remember correctly the latent in Wan is in 4 frames "blocks" so to speak. I was playing a while ago with the S2V model where is burning the first frames and you have to do the workaround to repeat the first latent block and cut the frames corresponding to it after the vae decoding. I was curious to see what is the output if I decode only that first block. The result was a 4 frames movie (+ 1 empty frame that made VirtuaDub to act strange). So the model can't do less than that. Perhaps can do 1 frame (meaning it's outputting the same image set as the start image) but the next number available is 1+4.