GigaSpeech Is XL subset the 33000hr unlabeled data?

Is XL subset the 33000hr unlabeled data?

Open mct10 opened this issue 2 years ago • 1 comments

Hi, As mentioned in the README, GigaSpeech contains "33,000+ hours for unsupervised/semi-supervised learning". I am trying to use these unlabeled data, and I have already downloaded the XL subset. But after I summed up the duration of each audio in GigaSpeech.json, the number is only around 25000 hour. So my question is, is the entire XL subset the 33,000 hour data, or are there any additional steps needed to retrieve the 33000 hour data? Many thanks!

May 23 '22 12:05 mct10

There are 33000+ hours audio files in total under GigaSpeech directory.
GigaSpeech.json contains 10000 hours of audio segments with transcription for supervised training.

Jun 07 '22 13:06 dophist

GigaSpeech GigaSpeech copied to clipboard

Is XL subset the 33000hr unlabeled data?

GigaSpeech
GigaSpeech copied to clipboard