Ximing Lu

Results 2 comments of Ximing Lu

1. The text part is dialogue text (subtitle) 2. For each [images, context_i, question_i, answer_i], we feed into the model and MLP, and takes max over the N logits. Basically,...

1. Yes, we extract the frames corresponding to ground-truth timestamps. 2. We use all subtitles, and cut it if it's longer than 732 tokens.