Magma Questions about data preprocessing

Looking at run_detect_segment, we can guess that it requires an annotation file for the video, and that file consists of a start time, an end time, and a text prompt. (The text prompt is not used in the code.)

I wonder if these annotations are created manually, or if they can be created automatically.

Also, when extracting features through the CLIP encoder in the run_clip_filtering file, what text input is required?

Finally, when will the pre-training dataset be released?

Thank you

Mar 24 '25 11:03 asm3242

Hi, @asm3242 , thanks for your interests!

I wonder if these annotations are created manually, or if they can be created automatically.

We leverage the annotations from the original dataset, e.g., the segment time stamps, and the corresponding language narrations or annotations. For a portion of the video data, we used GPT-4o to help to get more fine-grained action descriptions, as we did not have enough budget to annotate all.

Also, when extracting features through the CLIP encoder in the run_clip_filtering file, what text input is required?

For this, we either used the text annotations associated with the original data or the texts given by GPT-4o

Finally, when will the pre-training dataset be released?

Still working on this, trying to squeeze more time to work on this. Hopefully, it will be early next week!

Mar 27 '25 02:03 jwyang

@jwyang Thanks for the answer. If possible, could you tell me what prompt you used when using gpt4o?

Mar 31 '25 00:03 asm3242