ComfyUI Add support CogVideoX to ComfyUI standards

Refactor CogVideoXWrapper to ComfyUI standards

github: https://github.com/THUDM/CogVideo

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}
@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}

Here's an updated commit message that includes all the new additions:

Introduced foundational support for 3D image operations (T, W, H, C) in ComfyUI.
Added CogVideoX Latents encoding and decoding nodes.
Enhanced compatibility for T5-related CLIP operations, integrating them into the existing ComfyUI CLIP workflow.
Refactored CogVideoXPipeline operations to align with standard latents processing.
Updated node class mappings and implemented necessary changes for seamless integration.
Refactored nodes to ensure compatibility with ComfyUI standards, including:
- CogVideoModelLoader
- CogVideoPipeExtra
- CogVideoEncodePrompt
- CogVideoImageEncodeSampler
- CogVideoSamplerDecodeImages
- CogVideoProcessor
Redesigned code nodes based on kijai/ComfyUI-CogVideoXWrapper to adhere to ComfyUI standards. @kijai

workflow.json

Introduced the denoise_strength parameter, ranging from 0.1 to 1.0, to control the fidelity and diversity of generated images in CogVideoSampler.
A denoise_strength value of 0.9 is recommended for better alignment with video content, balancing detail preservation and creative flexibility.

workflow_v2v.json

feat: Implemented image-to-video latents using noise interpolation technique

Applied the noise interpolation method from the paper "Improving Image Fidelity in Image-to-Video Generation using Noise Interpolation" (https://arxiv.org/pdf/2403.02827.pdf) to achieve high source image fidelity in I2V generation.
Integrated the technique into our pipeline to enhance source image information during early denoising steps, leading to improved video output quality.
Referenced and adapted the approach from the implementation available at https://noise-rectification.github.io/. workflow_i2v.json

Aug 29 '24 12:08 glide-the

I think this should probably be a PR on the Cogvideo nodes as this is a diffusers wrapper, I'm not sure what you mean by Comfy standards in this case

Aug 31 '24 14:08 melMass

While I appreciate any efforts to contribute and improve upon any code I've worked on, I have to say I don't understand the point of this PR. From my point of view many of the changes to my code bring this further away from Comfy standards, like reverting back to diffusers T5 while my way (not saying it's perfect) of using native Comfy T5 has proven to work effectively, or does this have some issues I'm not aware of?

The main thing however is that this still relies too much on Diffusers to be merged into ComfyUI anyway. To properly implement CogVideoX to ComfyUI would include support for the VAE and support for using ComfyUI sampling instead of Diffusers, the latter being something I personally still don't have enough knowledge to do.

Aug 31 '24 15:08 kijai

probably

Thank you for the feedback, and I understand the concerns raised. You are correct that the 3DVAE and 3DTransformer components should be rewritten to fully integrate with the ComfyUI ecosystem rather than relying on the existing implementations from the Diffusers community. This will ensure compatibility and maintain the integrity of the project.

Regarding the T5 encoder, I acknowledge that I added some additional wrappers, but it currently supports native local functionality. As for the scheduler, you're right—it was mostly a wrapper with no significant modifications.

I appreciate your input, and I will address all these issues. I will rewrite the necessary components (scheduler, 3DVAE, 3DTransformer) to align with the project's standards. This PR can be treated as a draft revision rather than being merged at this stage.

Thank you for your understanding, and I'll work on the necessary changes.

Aug 31 '24 15:08 glide-the

probably

Thank you for the feedback, and I understand the concerns raised. You are correct that the 3DVAE and 3DTransformer components should be rewritten to fully integrate with the ComfyUI ecosystem rather than relying on the existing implementations from the Diffusers community. This will ensure compatibility and maintain the integrity of the project.

Regarding the T5 encoder, I acknowledge that I added some additional wrappers, but it currently supports native local functionality. As for the scheduler, you're right—it was mostly a wrapper with no significant modifications.

I appreciate your input, and I will address all these issues. I will rewrite the necessary components (scheduler, 3DVAE, 3DTransformer) to align with the project's standards. This PR can be treated as a draft revision rather than being merged at this stage.

Thank you for your understanding, and I'll work on the necessary changes.

That's fine but I'm pretty sure Comfy will never merge anything depending on diffusers, so I think all work towards a PR should go towards implementing everything natively.

Aug 31 '24 16:08 kijai