Wensong Song
Wensong Song
I want to use Multi-Task Facial Landmark (MTFL) dataset to train DDPM. I use the code bellow. ``` python from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer model = Unet( dim =...
What is the difference between finetuning the unet's image layers and training motion modules? Suppose I want to train animatediff on a small new dataset (about 72 minutes of video...
Thanks to the author for his work! When will the training code of SparseCtrl be released?
When I show Video-LLava a short video, given inp = 'Could you please provide a detailed description for this video? Your comprehensive video caption should allow listeners to visualize the...
Excellent job! I have three questions that are not clear to me. 1. I have some problems with Relation of Video-LLaVA and LanguageBind. Has Video-LLaVa use video encoder of LanguageBind?...
``` import torch from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT' # also 'LanguageBind/LanguageBind_Video' model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir') tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir') video_process = LanguageBindVideoProcessor(model.config, tokenizer) model.eval() data =...
In the [Reason-Edit evaluation benchmark](https://drive.google.com/drive/folders/1QGmye23P3vzBBXjVj2BuE7K3n8gaWbyQ), why is there a corresponding mask for each image, but the mask is not used in test/DS_SmartEdit_test.py?