MovieChat icon indicating copy to clipboard operation
MovieChat copied to clipboard

About video fragments

Open sameerKgp opened this issue 1 year ago • 11 comments

Hi thanks for providing the code of your work. In the code what is the video_fragment. Is it for the breakpoint mode? How to create these fragments? Also in the src/video_fragment, you have provided a clip from a different video (GOT) than the Cooking_cake one.

sameerKgp avatar May 14 '24 13:05 sameerKgp

video_fragment stores the video clip read by the sliding window, and it will be created and automatically updated. Also I didn't find the GOT video, can u point out the exact path? We didn't upload Cooking_cake since it is too big to upload on Github.

Espere-1119-Song avatar May 15 '24 02:05 Espere-1119-Song

Thanks for the reply. The cooking_cake video I got from the link provided in 15th issue. The GOT video is src/video_fragment/output.mp4

sameerKgp avatar May 15 '24 10:05 sameerKgp

I still don't know how to create the video fragment if I use my own video. There're no such functions that I can found in "Class chat". Maybe in "global mode", video fragment is also the original video? That means I need to store the same video in "video fragment path" as in the "video path"??

HTD1016 avatar Jul 09 '24 12:07 HTD1016

You just need to choose one video as the initialized video fragment at the beginning, and the others video fragments will be created automatically.

Espere-1119-Song avatar Jul 09 '24 22:07 Espere-1119-Song

Thanks for the reply. I used the MovieChat package in PyPI (version 0.6.3), and I carefully checked the code in the package. In /anaconda/envs/MovieChat/lib/python3.9/site-packages/MovieChat/models/chat_model.py:

for i in range(num_frames): 
    print(f"current processed frames: {i+1} / {num_frames}")
    video_fragment = self.parse_video_fragment(video_path=video_path, video_length=video_length, n_stage=i)         
    video_fragment, msg = self.load_video(
        video_path=fragment_video_path,
        n_frms=4, 
        height=224,
        width=224
    )
    video_fragment = self.vis_processor.transform(video_fragment) 
    video_fragment = video_fragment.unsqueeze(0).to(self.device)

where the function self.parse_video_fragment() is used for create the video fragment, then the next function self.load_video() can be able to read the video fragment in from fragment_video_path. But it can be seen from here that function self.parse_video_fragment() should save the video fragment locally. Now take a look at the self.parse_video_fragment() function:

def parse_video_fragment(self, video_path, fragment_video_path, video_length, n_stage = 0):
    decord.bridge.set_bridge("torch")
    per_video_length = video_length / self.n_samples
    fragment_video = self.capture_video(video_path, per_video_length, n_stage)
    fragment_video.write_videofile(fragment_video_path)  # This code was added by me, as well as the parameter "fragment_video_path"
    return fragment_video

So I think there is a missing sentence of code here. After I added this sentence of code, the code can work normally. And I noticed that the author's code repository also provides a local version of MovieChat, which includes this sentence of code. However, due to the time cost for the Moviepy to write videos, the inference time of the entire code also becomes very long

HTD1016 avatar Jul 10 '24 01:07 HTD1016

Thank you very much for discovering this issue. We will recheck our code and update the MovieChat package as soon as possible to resolve this problem.

Espere-1119-Song avatar Jul 11 '24 04:07 Espere-1119-Song

for i in range(num_frames): print(f"current processed frames: {i+1} / {num_frames}") video_fragment = self.parse_video_fragment(video_path=video_path, video_length=video_length, n_stage=i)
video_fragment, msg = self.load_video( video_path=fragment_video_path, n_frms=4, height=224, width=224 ) video_fragment = self.vis_processor.transform(video_fragment) video_fragment = video_fragment.unsqueeze(0).to(self.device)

I noticed that the video_fragment variable is assigned a value in line 3, but then immediately overwritten in line 4. It seems like the assignment in line 3 might be redundant since its value is not used before it's reassigned.

ywh187 avatar Sep 02 '24 09:09 ywh187

I understand what you mean. During implementation, we found that some versions of ffmpeg may not support initializing a blank video fragment, so we used an unrelated video clip for initialization.

Espere-1119-Song avatar Sep 02 '24 09:09 Espere-1119-Song

@HTD1016 You are just amazing!!!

allent4n avatar Oct 22 '24 14:10 allent4n

video_fragment stores the video clip read by the sliding window, and it will be created and automatically updated. Also I didn't find the GOT video, can u point out the exact path? We didn't upload Cooking_cake since it is too big to upload on Github.

Hi, I have two little questions for these two hyperparameters in run_inference_qa_msvd.py:

MAX_INT = 8 N_SAMPLES = 32

According to my understanding, does the N_SAMPLES specify how many fragments (or sliding windows) will be created for each video, and the MAX_INT specify how many frames we will use for encoding as LLM input for each fragment/sliding window?

oximi123 avatar Oct 29 '24 08:10 oximi123

Sorry for the confusion. N_SAMPLES specifies how many fragments (or sliding windows) will be created for each video. However, MAX_INT is not utilized in the current implementation. In our code, the number of frames included within each sliding window corresponds to the length of the short-term memory window used for encoding.

Espere-1119-Song avatar Oct 29 '24 08:10 Espere-1119-Song