ConferencingSpeech2021 icon indicating copy to clipboard operation
ConferencingSpeech2021 copied to clipboard

Missing data in Audioset

Open Emrys365 opened this issue 3 years ago • 6 comments

Hello,

I was trying to run the simulation with the given selected_list, but I found some of the IDs for Audioset is not accessible now. Below I list part of them (I haven't check all of the sample IDs):

HKTIe6piDOI
M7GmqUqVQEA
Hm20kZ7QzO0
oz3LrVaXMb4
6-kHUulyCog
TGd5kPDdN_I
IjoePLT_cFw
dKK-JaIzwS4
Cmhpj4MJ_hQ
NbBM82N1Xos
2JoJ_1agmTk
8YIELHXpf3g
AdLiRtpI01s
AgVZ65Hr9rw
4fh52mLYBYw
KKoTQfro920
L6DFGW6jeV8
X61ftZ590Uc
pK1ucosjoRo
Lpzx6N2aCMY
lnWP_zWFpBg
mg2rhu_HHR0

For example, if you go to https://www.youtube.com/watch?v=6-kHUulyCog, it says the video is unavailable. If you go to https://www.youtube.com/watch?v=Lpzx6N2aCMY, it says the video becomes private.

Could you release the unavailable samples in Audioset directly, or just change the selected list for Audioset?

Emrys365 avatar Feb 22 '21 04:02 Emrys365

Same issue here. Actually, I found the ytid of selected AudioSet samples in selected_list are all from balanced_train and eval_segments of original AudioSet, and in my case, there are 1915 ytids in selected_list are not available. They are moved, deleted, set private, or can not be approached in the US.

km4sh avatar Feb 22 '21 09:02 km4sh

Same issue here. Actually, I found the ytid of selected AudioSet samples in selected_list are all from balanced_train and eval_segments of original AudioSet, and in my case, there are 1915 ytids in selected_list are not available. They are moved, deleted, set private, or can not be approached in the US.

Thank you for your comment.

I finally checked the list, and got 1076 ytids unavailable. I attached the list of unavailable wav files here: missing.txt

Emrys365 avatar Feb 22 '21 14:02 Emrys365

The youtube video is dynamic, and we cannot fully avoid the issue...

Anyway, I reported this issue to the main organizers already. I recommend you to contact [email protected] They will deal with this issue.

sw005320 avatar Feb 22 '21 14:02 sw005320

The youtube video is dynamic, and we cannot fully avoid the issue...

Anyway, I reported this issue to the main organizers already. I recommend you to contact [email protected] They will deal with this issue.

Thank you, Shinji. I will contact them.

Emrys365 avatar Feb 22 '21 15:02 Emrys365

Yes this is unfortunately a common problem with Audioset. Some videos have been pulled off some were cancelled by original uploaders...

popcornell avatar Feb 23 '21 21:02 popcornell

This sample 0N0C0Wbe6AI_30.000.wav in https://github.com/ConferencingSpeech/ConferencingSpeech2021/blob/49d3b2fc47/selected_lists/train/audioset.name#L22677 seems to be wrong? Because the video https://www.youtube.com/watch?v=0N0C0Wbe6AI is only 25-sec long.

Emrys365 avatar Feb 24 '21 11:02 Emrys365