merlot icon indicating copy to clipboard operation
merlot copied to clipboard

Fine-tune on TVQA dataset

Open Curry-AI opened this issue 3 years ago • 5 comments

Thank you very much for your work. May I ask if you can release the code for fine-tune on tvqa dataset

Curry-AI avatar Jun 19 '21 03:06 Curry-AI

I have some details about data processing that are not very clear. If you can help me, I would be very grateful

1.It is mentioned in the paper that in TVQA, each video sample 6 frames evenly. What is the text content of each frame? If it is dialogue text, how to select the corresponding dialogue text for each picture? If not, what is the content of the text?

2.Question and answer constitute five groups of hypotheses, and then through MLP, take the hypotheses CLS_TOKEN respectively, and then concat them with the image CLS_TOKEN? Or by some other way

I really hope to get confirmation of these details. Thank you very much

Curry-AI avatar Jun 25 '21 13:06 Curry-AI

  1. The text part is dialogue text (subtitle)

  2. For each [images, context_i, question_i, answer_i], we feed into the model and MLP, and takes max over the N logits. Basically, we copied images part N time to concatenate them with N candidates separately.

Let us know if you have further questions!

GloriaXimingLu avatar Jul 14 '21 23:07 GloriaXimingLu

This is an awesome work! Do you have the plan to release the pre-train model on TVQA+ and TVQA?

Lee-Ft avatar Jul 21 '21 08:07 Lee-Ft

I also have some questions about TVQA finetuning as I am trying to reproduce your results.

  1. Do you use the ground-truth timestamps of the question to select frames from the video, provided from the TVQA dataset?

  2. How do you select the subtitles exactly? Subtitles are pretty long (like 260 tokens on average) so I can't fit them all into the input sequence.

It would be very helpful if you could give more detail on how the input to the model looks like for TVQA. Thanks!

simon-ging avatar Aug 19 '21 19:08 simon-ging

  1. Yes, we extract the frames corresponding to ground-truth timestamps.

  2. We use all subtitles, and cut it if it's longer than 732 tokens.

GloriaXimingLu avatar Aug 30 '21 12:08 GloriaXimingLu