Ximing Lu comments

Repositories
Issues
Comments

Results 2 comments of


                                            Ximing Lu

Fine-tune on TVQA dataset

1. The text part is dialogue text (subtitle) 2. For each [images, context_i, question_i, answer_i], we feed into the model and MLP, and takes max over the N logits. Basically,...

Fine-tune on TVQA dataset

1. Yes, we extract the frames corresponding to ground-truth timestamps. 2. We use all subtitles, and cut it if it's longer than 732 tokens.