GigaSpeech
GigaSpeech copied to clipboard
Is XL subset the 33000hr unlabeled data?
Hi,
As mentioned in the README, GigaSpeech contains "33,000+ hours for unsupervised/semi-supervised learning". I am trying to use these unlabeled data, and I have already downloaded the XL subset. But after I summed up the duration
of each audio in GigaSpeech.json
, the number is only around 25000 hour.
So my question is, is the entire XL subset the 33,000 hour data, or are there any additional steps needed to retrieve the 33000 hour data?
Many thanks!
- There are 33000+ hours audio files in total under GigaSpeech directory.
- GigaSpeech.json contains 10000 hours of audio segments with transcription for supervised training.