merlot
merlot copied to clipboard
Fine-tune on TVQA dataset
Thank you very much for your work. May I ask if you can release the code for fine-tune on tvqa dataset
I have some details about data processing that are not very clear. If you can help me, I would be very grateful
1.It is mentioned in the paper that in TVQA, each video sample 6 frames evenly. What is the text content of each frame? If it is dialogue text, how to select the corresponding dialogue text for each picture? If not, what is the content of the text?
2.Question and answer constitute five groups of hypotheses, and then through MLP, take the hypotheses CLS_TOKEN respectively, and then concat them with the image CLS_TOKEN? Or by some other way
I really hope to get confirmation of these details. Thank you very much
-
The text part is dialogue text (subtitle)
-
For each [images, context_i, question_i, answer_i], we feed into the model and MLP, and takes max over the N logits. Basically, we copied images part N time to concatenate them with N candidates separately.
Let us know if you have further questions!
This is an awesome work! Do you have the plan to release the pre-train model on TVQA+ and TVQA?
I also have some questions about TVQA finetuning as I am trying to reproduce your results.
-
Do you use the ground-truth timestamps of the question to select frames from the video, provided from the TVQA dataset?
-
How do you select the subtitles exactly? Subtitles are pretty long (like 260 tokens on average) so I can't fit them all into the input sequence.
It would be very helpful if you could give more detail on how the input to the model looks like for TVQA. Thanks!
-
Yes, we extract the frames corresponding to ground-truth timestamps.
-
We use all subtitles, and cut it if it's longer than 732 tokens.