ltu Question：Why are the prompts for training and inference for audio event classification are different?

Question：Why are the prompts for training and inference for audio event classification are different?

Open peggyxpxu opened this issue 9 months ago • 2 comments

Hi,sir: I find the prompts for training and testing for audio event classification are different in the code. In the train task ”cla_label”, one example of the question is "Identify the audio’s noise? Produce solely audio identifiers.”，these questions are all directly related to classification tasks. But in inference, all audio event classification questions are asked in the form of audio captions, for example 'Write an audio caption describing the sound?'. May I ask why different questions are used during training and testing? Why not use the same type of prompt as during training? Will this not affect the test effect? Thanks!

May 06 '24 07:05 peggyxpxu

hi there, thanks for the question.

We mentioned this in the paper, page 6, Section 5.1, under subsection "audio classification":

We tested two prompts “classify the sound events in the audio clip” and “write an audio caption describing the sound”, while both led to good results in our subjective evaluation, the latter led to better text embedding for the automatic evaluation framework and is used for benchmarking.

May I ask why different questions are used during training and testing? Why not use the same type of prompt as during training?

Training also has classification and captioning. We just benchmark it using the captioning benchmark for better performance.

Will this not affect the test effect?

As we mentioned in the paper, the captioning prompt leads to better performance as it encourages the model to say more about the sound. Classification prompt typically gets a concise class name. The LTU model is an open-ended model and often use synonym of the class name as answer. In that case, it is hard to fairly benchmark it.

-Yuan

May 07 '24 08:05 YuanGongND

hi there, thanks for the question.

We mentioned this in the paper, page 6, Section 5.1, under subsection "audio classification":

We tested two prompts “classify the sound events in the audio clip” and “write an audio caption describing the sound”, while both led to good results in our subjective evaluation, the latter led to better text embedding for the automatic evaluation framework and is used for benchmarking.

May I ask why different questions are used during training and testing? Why not use the same type of prompt as during training?

Training also has classification and captioning. We just benchmark it using the captioning benchmark for better performance.

Will this not affect the test effect?

As we mentioned in the paper, the captioning prompt leads to better performance as it encourages the model to say more about the sound. Classification prompt typically gets a concise class name. The LTU model is an open-ended model and often use synonym of the class name as answer. In that case, it is hard to fairly benchmark it.

-Yuan

I understand,Thanks!

May 08 '24 08:05 peggyxpxu

ltu ltu copied to clipboard

Question：Why are the prompts for training and inference for audio event classification are different?

ltu
ltu copied to clipboard