ltu icon indicating copy to clipboard operation
ltu copied to clipboard

Questions about data construction

Open zengxijuan opened this issue 1 year ago • 1 comments

Hello, thank you for your excellent work. I have a few questions about data construction:

  1. How do different data sets allocate the proportion to generate QA pairs? For example, how does AudioSet data determine which audio segments are used to generate Classification data and which audio segments are used to generate Acoustic Features data?
  2. For the question construction of closed set data, since it is generated by GPT, will there be repeated questions? Do you generate a set of problems and then randomly select them, or do you call the interface for each segment?
  3. When processing data sets, how to deal with the case of data intersection between different data sets? Look forward to your reply, thank you!

zengxijuan avatar Jul 28 '23 08:07 zengxijuan

hi there,

How do different data sets allocate the proportion to generate QA pairs? For example, how does AudioSet data determine which audio segments are used to generate Classification data and which audio segments are used to generate Acoustic Features data?

Usually, we generate all possible qa for each sample, e.g., for AudioSet, almost all samples have a question about classification and a question about the feature, respectively.

For the question construction of closed set data, since it is generated by GPT, will there be repeated questions? Do you generate a set of problems and then randomly select them, or do you call the interface for each segment?

For closed-ended questions, yes, there are (many) repeat questions. Our closed-ended data is at a million level, so it is impossible/not necessary to have different questions for each closed-ended task. Practically, we paraphrase each closed-ended question ten to a hundred times using GPT.

When processing data sets, how to deal with the case of data intersection between different data sets?

There are overlapped audios (not many), but it is not a big issue. When it is from different datasets, it usually has different types of annotation. We just treat them as independent audio samples.

-Yuan

YuanGongND avatar Jul 28 '23 20:07 YuanGongND